Efficient use of GPU. Parallel computing on NVIDIA GPUs or a supercomputer in every home Calculation from gpu to cpu


Features of the AMD/ATI Radeon architecture

This is similar to the birth of new biological species, when, during the development of habitats, living beings evolve to improve their adaptability to the environment. Likewise, the GPU, starting with accelerating the rasterization and texturing of triangles, developed additional capabilities for executing shader programs for coloring these same triangles. And these abilities are also in demand in non-graphical computing, where in some cases they provide significant performance gains compared to traditional solutions.

Let us draw further analogies - after a long evolution on land, mammals penetrated into the sea, where they displaced ordinary marine inhabitants. In competition, mammals used both new advanced abilities that appeared on the earth’s surface, and those specially acquired for adaptation to life in water. Likewise, GPUs, based on the architecture's strengths for 3D graphics, are increasingly adding specialized functionality useful for non-graphics tasks.

So, what allows GPUs to claim their own sector in the general-purpose software space? The GPU microarchitecture is built completely differently than that of conventional CPUs, and it inherently contains certain advantages. Graphics tasks require independent parallel processing, and the GPU is natively multi-threaded. But this parallelism only brings him joy. The microarchitecture is designed to exploit the large number of threads available that require execution.

The GPU consists of several dozen (30 for Nvidia GT200, 20 for Evergreen, 16 for Fermi) processor cores, which are called Streaming Multiprocessor in Nvidia terminology, and SIMD Engine in ATI terminology. For the purposes of this article, we will call them miniprocessors, because they execute several hundred program threads and can do almost everything that a regular CPU can do, but still not everything.

Marketing names are confusing - for greater importance, they indicate the number of functional modules that can subtract and multiply: for example, 320 vector “cores”. These kernels are more like grains. It is better to think of the GPU as a kind of multi-core processor with a large number of cores executing many threads simultaneously.

Each miniprocessor has local memory, 16 KB for GT200, 32 KB for Evergreen and 64 KB for Fermi (essentially a programmable L1 cache). It has an access time similar to the first level cache of a conventional CPU and performs similar functions for the fastest delivery of data to functional modules. In the Fermi architecture, part of the local memory can be configured as a regular cache. In a GPU, local memory is used for fast data exchange between executing threads. One of the usual schemes of a GPU program is as follows: first, data from the GPU global memory is loaded into local memory. This is just ordinary video memory, located (like system memory) separately from “its” processor - in the case of video, it is soldered by several chips on the PCB of the video card. Next, several hundred threads work with this data in local memory and write the result to global memory, after which it is transferred to the CPU. It is the programmer's responsibility to write instructions for loading and unloading data from local memory. Essentially, it is partitioning the data [of a specific task] for parallel processing. The GPU also supports atomic write/read instructions into memory, but they are ineffective and are usually needed at the final stage to “glue together” the calculation results of all miniprocessors.

Local memory is common to all threads executing in the miniprocessor, therefore, for example, in Nvidia terminology it is even called shared, and the term local memory denotes the exact opposite, namely: a certain personal area of ​​​​a separate thread in global memory, visible and accessible only to it. But in addition to local memory, the miniprocessor has another memory area, which in all architectures is approximately four times larger in volume. It is divided equally between all executing threads; these are registers for storing variables and intermediate calculation results. Each thread has several dozen registers. The exact number depends on how many threads the miniprocessor is running. This number is very important, since the latency of global memory is very high, hundreds of cycles, and in the absence of caches there is nowhere to store intermediate results of calculations.

And one more important feature of the GPU: “soft” vectorization. Each miniprocessor has a large number of compute modules (8 for GT200, 16 for Radeon and 32 for Fermi), but they can all execute only the same instruction, with the same program address. In this case, the operands can be different, different threads have their own. For example, instructions add the contents of two registers: it is executed simultaneously by all computing devices, but the registers are taken different. It is assumed that all threads of the GPU program, performing parallel data processing, generally move in a parallel course through the program code. Thus, all computing modules are loaded evenly. And if the threads diverge in their code execution path due to branches in the program, then so-called serialization occurs. Then not all computing modules are used, since the threads submit various instructions for execution, and a block of computing modules can execute, as we have already said, only an instruction with one address. And, of course, productivity drops relative to maximum.

The advantage is that vectorization is completely automatic, it is not programming using SSE, MMX, and so on. And the GPU itself handles the discrepancies. Theoretically, you can generally write programs for the GPU without thinking about the vector nature of the execution modules, but the speed of such a program will not be very high. The downside is the large width of the vector. It is larger than the nominal number of functional modules and is 32 for Nvidia GPUs and 64 for Radeon. The threads are processed in blocks of appropriate size. Nvidia calls this block of threads the term warp, AMD calls it wave front, which is the same thing. Thus, on 16 computing devices, a “wavefront” with a length of 64 threads is processed in four clock cycles (assuming the usual instruction length). The author prefers the term warp in this case, due to the association with the nautical term warp, meaning a rope tied from twisted ropes. So the threads “twist” and form a solid bundle. However, “wave front” can also be associated with the sea: instructions arrive at the actuators in the same way as waves roll onto the shore one after another.

If all threads are equally advanced in program execution (located in the same place) and thus executing the same instruction, then everything is fine, but if not, slowdown occurs. In this case, threads from one warp or wave front are located in different places in the program; they are divided into groups of threads that have the same instruction number value (in other words, instruction pointer). And only the threads of one group are still executed at one time - all execute the same instruction, but with different operands. As a result, warp runs as many times slower as the number of groups it is divided into, and the number of threads in the group does not matter. Even if the group consists of only one thread, it will still take the same amount of time to execute as a full warp. In hardware, this is implemented by masking certain threads, that is, instructions are formally executed, but the results of their execution are not recorded anywhere and are not used in the future.

Although at any given time each miniprocessor (Streaming MultiProcessor or SIMD Engine) executes instructions belonging to only one warp (a bunch of threads), it has several dozen active warps in the execution pool. Having executed the instructions of one warp, the miniprocessor executes not the next instruction of the threads of this warp, but the instructions of some other warp. That warp can be in a completely different place in the program, this will not affect the speed, since only inside the warp the instructions of all threads must be the same for execution at full speed.

In this case, each of the 20 SIMD Engines has four active wave fronts, each with 64 threads. Each thread is indicated by a short line. Total: 64×4×20=5120 threads

Thus, given that each warp or wave front consists of 32-64 threads, the miniprocessor has several hundred active threads that execute almost simultaneously. Below we will see what architectural benefits such a large number of parallel threads promises, but first we will consider what limitations the miniprocessors that make up the GPU have.

The main thing is that the GPU does not have a stack where function parameters and local variables could be stored. Due to the large number of threads, there is simply no space on the chip for the stack. Indeed, since the GPU simultaneously executes about 10,000 threads, with a stack size of one thread of 100 KB, the total volume will be 1 GB, which is equal to the standard amount of all video memory. Moreover, there is no way to place a stack of any significant size in the GPU core itself. For example, if you put 1000 bytes of stack on a thread, then only one miniprocessor would require 1 MB of memory, which is almost five times the combined amount of local memory of the miniprocessor and the memory allocated for storing registers.

Therefore, there is no recursion in a GPU program, and there is not much to be done with function calls. All functions are directly inserted into the code when compiling the program. This limits the scope of GPU applications to computational-type tasks. It is sometimes possible to use limited stack emulation using global memory for recursion algorithms with known small iteration depths, but this is not a typical GPU application. To do this, it is necessary to specially develop an algorithm and explore the possibility of its implementation without guaranteeing successful acceleration compared to the CPU.

Fermi introduced the ability to use virtual functions for the first time, but again their use is limited by the lack of a large, fast cache for each thread. 1536 threads account for 48 KB or 16 KB of L1, that is, virtual functions in a program can be used relatively rarely, otherwise the stack will also use slow global memory, which will slow down execution and, most likely, will not bring benefits compared to the CPU version.

Thus, the GPU is represented as a computing coprocessor into which data is loaded, it is processed by some algorithm, and the result is produced.

Architecture benefits

But it calculates the GPU very quickly. And its high multithreading helps it with this. A large number of active threads makes it possible to partially hide the high latency of the separately located global video memory, which is about 500 clock cycles. It is leveled out especially well for code with a high density of arithmetic operations. Thus, the transistor-expensive L1-L2-L3 cache hierarchy is not required. Instead, multiple compute modules can be placed on the chip, providing outstanding arithmetic performance. While the instructions of one thread or warp are being executed, the remaining hundreds of threads are quietly waiting for their data.

Fermi introduced a L2 cache of about 1 MB in size, but it cannot be compared with the caches of modern processors, it is more intended for communication between cores and various software tricks. If its size is divided among all tens of thousands of threads, each will have a very negligible volume.

But in addition to global memory latency, there are many more latencies in a computing device that need to be hidden. This is the latency of on-chip data transfer from computing devices to the first level cache, that is, the local memory of the GPU, and to the registers, as well as the instruction cache. The register file, as well as local memory, are located separately from the functional modules, and the access speed to them is approximately one and a half dozen cycles. And again, a large number of threads, active warps, can effectively hide this latency. Moreover, the total access bandwidth (bandwidth) to the local memory of the entire GPU, taking into account the number of miniprocessors composing it, is significantly greater than the access bandwidth to the first level cache of modern CPUs. The GPU can process significantly more data per unit of time.

We can immediately say that if the GPU is not provided with a large number of parallel threads, then it will have almost zero performance, because it will work at the same pace as if fully loaded, and will do much less work. For example, let there be only one thread instead of 10,000: performance will drop by about a thousand times, because not only will not all blocks be loaded, but all latencies will also be affected.

The problem of hiding latencies is also acute for modern high-frequency CPUs; sophisticated methods are used to eliminate it - deep pipelining, out-of-order execution of instructions. This requires complex instruction schedulers, various buffers, etc., which takes up space on the chip. This is all required for best single-threaded performance.

But all this is not needed for the GPU; it is architecturally faster for computing tasks with a large number of threads. But it transforms multithreading into performance, like the philosopher's stone turns lead into gold.

The GPU was originally designed for optimal execution of shader programs for triangle pixels, which are obviously independent and can be executed in parallel. And from this state it has evolved by adding various capabilities (local memory and addressable access to video memory, as well as complicating the instruction set) into a very powerful computing device, which can still be effectively used only for algorithms that allow highly parallel implementation using a limited amount of local memory. memory.

Example

One of the most classic problems for GPU is the problem of calculating the interaction of N bodies creating a gravitational field. But if we, for example, need to calculate the evolution of the Earth-Moon-Sun system, then the GPU is a bad help for us: there are few objects. For each object, it is necessary to calculate interactions with all other objects, and there are only two of them. In the case of the motion of the Solar System with all the planets and their moons (about a couple hundred objects), the GPU is still not very efficient. However, due to the high overhead of thread management, a multi-core processor will also not be able to display all its power and will operate in single-threaded mode. But if you also need to calculate the trajectories of comets and asteroid belt objects, then this is already a task for the GPU, since there are enough objects to create the required number of parallel calculation threads.

The GPU will also perform well if you need to calculate the collision of globular clusters of hundreds of thousands of stars.

Another opportunity to use GPU power in an N-body problem arises when you need to calculate many individual problems, albeit with a small number of bodies. For example, if you need to calculate options for the evolution of one system for different options for initial velocities. Then you can effectively use the GPU without any problems.

AMD Radeon microarchitecture details

We looked at the basic principles of GPU organization; they are common to video accelerators from all manufacturers, since they initially had one target task - shader programs. However, manufacturers have found an opportunity to differ on the details of the microarchitectural implementation. Although CPUs from different vendors are sometimes very different, even if they are compatible, such as Pentium 4 and Athlon or Core. The Nvidia architecture is already quite widely known, now we will look at Radeon and highlight the main differences in the approaches of these vendors.

AMD video cards received full support for general-purpose computing starting with the Evergreen family, which also implemented DirectX 11 specifications for the first time. Cards from the 47xx family have a number of significant limitations, which will be discussed below.

The differences in the size of local memory (32 KB for Radeon versus 16 KB for GT200 and 64 KB for Fermi) are generally not significant. As well as the wave front size of 64 threads for AMD versus 32 threads in warp for Nvidia. Almost any GPU program can be easily reconfigured and adjusted to these parameters. Performance can change by tens of percent, but in the case of a GPU this is not so important, because a GPU program usually runs ten times slower than its CPU counterpart, or ten times faster, or does not work at all.

More important is AMD's use of VLIW (Very Long Instruction Word) technology. Nvidia uses scalar simple instructions that operate on scalar registers. Its accelerators implement simple classical RISC. AMD video cards have the same number of registers as the GT200, but the registers are 128-bit vector. Each VLIW instruction operates on multiple four-component 32-bit registers, which is similar to SSE, but VLIW has much more capabilities. This is not SIMD (Single Instruction Multiple Data) like SSE - here the instructions for each pair of operands can be different and even dependent! For example, let the components of register A be called a1, a2, a3, a4; register B is similar. Can be calculated with a single instruction that executes in one clock cycle, for example, the number a1×b1+a2×b2+a3×b3+a4×b4 or a two-dimensional vector (a1×b1+a2×b2, a3×b3+a4×b4 ).

This was made possible due to the lower frequency of the GPU than the CPU and the strong reduction in process technology in recent years. In this case, no scheduler is required; almost everything is executed in a clock cycle.

Thanks to vector instructions, Radeon's peak single-precision performance is very high, reaching teraflops.

One vector register can store one double precision number instead of four single precision numbers. And one VLIW instruction can either add two pairs of double numbers, or multiply two numbers, or multiply two numbers and add with a third. Thus, peak performance in double is about five times lower than in float. For older Radeon models, it corresponds to the performance of Nvidia Tesla on the new Fermi architecture and is much higher than the performance of double cards on the GT200 architecture. In Fermi-based Geforce consumer video cards, the maximum speed of double calculations has been reduced by four times.


Schematic diagram of Radeon operation. Only one miniprocessor out of 20 running in parallel is presented

GPU manufacturers, unlike CPU manufacturers (primarily x86-compatible ones), are not bound by compatibility issues. A GPU program is first compiled into some intermediate code, and when the program runs, the driver compiles this code into model-specific machine instructions. As described above, GPU manufacturers have taken advantage of this by coming up with convenient ISA (Instruction Set Architecture) for their GPUs and changing them from generation to generation. In any case, this added some percentage of performance due to the absence (as unnecessary) of a decoder. But AMD went even further by coming up with its own format for arranging instructions in machine code. They are not arranged sequentially (according to the program listing), but in sections.

First comes the section of conditional branch instructions, which have links to sections of continuous arithmetic instructions corresponding to the various branch branches. They are called VLIW bundles. These sections contain only arithmetic instructions with data from registers or local memory. This organization simplifies the management of the flow of instructions and their delivery to executive devices. This is all the more useful given that VLIW instructions are relatively large in size. There are also sections for memory access instructions.

Conditional jump instruction sections
Section 0Branch 0Link to section 3 of continuous arithmetic instructions
Section 1Branch 1Link to section No. 4
Section 2Branch 2Link to section No. 5
Continuous Arithmetic Instruction Sections
Section 3VLIW instruction 0VLIW instruction 1VLIW instruction 2VLIW instruction 3
Section 4VLIW instruction 4VLIW instruction 5
Section 5VLIW instruction 6VLIW instruction 7VLIW instruction 8VLIW instruction 9

GPUs from both Nvidia and AMD also have built-in instructions to quickly calculate basic math functions, square root, exponent, logarithms, sines, and cosines for single-precision numbers in a few clock cycles. There are special computing units for this. They “originate” from the need to implement fast approximation of these functions in geometry shaders.

Even if someone did not know that GPUs are used for graphics, and only read the technical characteristics, then by this sign he could guess that these computing coprocessors originated from video accelerators. Likewise, based on certain traits of marine mammals, scientists realized that their ancestors were land creatures.

But a more obvious feature that reveals the graphical origin of the device is the 2D and 3D texture reading units with support for bilinear interpolation. They are widely used in GPU programs, as they provide accelerated and simplified reading of read-only data arrays. One of the standard behaviors of a GPU application is to read arrays of source data, process them in the computing cores, and write the result to another array, which is then transferred back to the CPU. This scheme is standard and common because it is convenient for the GPU architecture. Tasks that require intensive reads and writes into one large region of global memory, thus containing data dependencies, are difficult to parallelize and implement efficiently on the GPU. Also, their performance will greatly depend on the latency of global memory, which is very high. But if the task is described by the pattern “reading data - processing - writing the result,” then you can almost certainly get a big boost from executing it on the GPU.

For texture data in the GPU, there is a separate hierarchy of small caches of the first and second levels. This is what provides acceleration from using textures. This hierarchy originally appeared in GPUs in order to take advantage of the locality of access to textures: obviously, after processing one pixel, a neighboring pixel (with a high probability) will require nearby texture data. But many algorithms for conventional calculations have a similar nature of data access. So texture caches from graphics will be very useful.

Although the size of the L1-L2 caches in Nvidia and AMD cards is approximately similar, which is obviously caused by the requirements for optimality in terms of game graphics, the access latency to these caches varies significantly. Nvidia has higher access latency, and texture caches in GeForce primarily help reduce the load on the memory bus, rather than directly speed up data access. This is not noticeable in graphics programs, but is important for general purpose programs. In Radeon, the latency of the texture cache is lower, but the latency of the local memory of miniprocessors is higher. We can give the following example: for optimal matrix multiplication on Nvidia cards, it is better to use local memory, loading the matrix there block by block, while for AMD it is better to rely on a low-latency texture cache, reading matrix elements as needed. But this is already a fairly subtle optimization, and for an algorithm that has already been fundamentally transferred to the GPU.

This difference also shows up when using 3D textures. One of the first GPU computing benchmarks, which showed a serious advantage for AMD, used 3D textures, as it worked with a three-dimensional data array. And the latency of access to textures in Radeon is significantly faster, and the 3D case is additionally more optimized in hardware.

To obtain maximum performance from hardware from various companies, some tuning of the application for a specific card is required, but this is an order of magnitude less significant than the development of an algorithm for the GPU architecture in principle.

Radeon 47xx Series Limitations

In this family, support for GPU computing is incomplete. Three important points can be noted. Firstly, there is no local memory, that is, it is physically there, but does not have the universal access required by the modern standard of GPU programs. It is emulated in software in global memory, meaning its use, unlike a full-featured GPU, will not bring benefits. The second point is the limited support for various atomic memory operations instructions and synchronization instructions. And the third point is the rather small size of the instruction cache: starting from a certain program size, the speed slows down significantly. There are other minor restrictions. We can say that only programs ideally suited for the GPU will work well on this video card. Although in simple test programs that operate only with registers, a video card can show good results in Gigaflops, it is problematic to effectively program something complex for it.

Advantages and disadvantages of Evergreen

If you compare AMD and Nvidia products, from a GPU computing perspective, the 5xxx series looks like a very powerful GT200. So powerful that it surpasses Fermi in peak performance by about two and a half times. Especially after the parameters of the new Nvidia video cards were cut down and the number of cores was reduced. But the introduction of an L2 cache in Fermi simplifies the implementation of some algorithms on the GPU, thus expanding the scope of the GPU. Interestingly, for CUDA programs well optimized for the previous generation of GT200, Fermi’s architectural innovations often did nothing. They accelerated in proportion to the increase in the number of computing modules, that is, less than twice (for single-precision numbers), or even less, because the memory bandwidth did not increase (or for other reasons).

And in tasks that are well suited to the GPU architecture and have a pronounced vector nature (for example, matrix multiplication), Radeon shows performance relatively close to the theoretical peak and outperforms Fermi. Not to mention multi-core CPUs. Especially in problems with single precision numbers.

But Radeon has a smaller die area, less heat dissipation, power consumption, higher yield and, accordingly, lower cost. And directly in 3D graphics tasks, Fermi's gain, if it exists at all, is much less than the difference in the crystal area. This is largely due to the fact that the Radeon computing architecture with 16 compute units per miniprocessor, a wave front size of 64 threads and VLIW vector instructions is excellent for its main task - computing graphics shaders. For the vast majority of ordinary users, gaming performance and price are priorities.

From a professional, scientific software perspective, the Radeon architecture provides the best price-performance, performance-per-watt, and absolute performance on tasks that are inherently well-matched to GPU architectures, allowing for parallelization and vectorization.

For example, in a fully parallel, easily vectorizable key selection task, Radeon is several times faster than GeForce and several tens of times faster than CPU.

This is consistent with the general concept of AMD Fusion, according to which GPUs should complement the CPU, and in the future be integrated into the CPU core itself, just as the math coprocessor was previously moved from a separate chip to the processor core (this happened twenty years ago, before the appearance of the first Pentium processors). The GPU will be an integrated graphics core and vector coprocessor for streaming tasks.

Radeon uses a clever technique of mixing instructions from different wave fronts when executed by function modules. This is easy to do since the instructions are completely independent. The principle is similar to the pipelined execution of independent instructions by modern CPUs. Apparently, this makes it possible to efficiently execute complex, multi-byte vector VLIW instructions. In a CPU, this requires a sophisticated scheduler to identify independent instructions or the use of Hyper-Threading technology, which also supplies the CPU with deliberately independent instructions from different threads.

measure 0measure 1measure 2measure 3bar 4measure 5bar 6bar 7VLIW module
wave front 0wave front 1wave front 0wave front 1wave front 0wave front 1wave front 0wave front 1
instr. 0instr. 0instr. 16instr. 16instr. 32instr. 32instr. 48instr. 48VLIW0
instr. 1VLIW1
instr. 2VLIW2
instr. 3VLIW3
instr. 4VLIW4
instr. 5VLIW5
instr. 6VLIW6
instr. 7VLIW7
instr. 8VLIW8
instr. 9VLIW9
instr. 10VLIW10
instr. elevenVLIW11
instr. 12VLIW12
instr. 13VLIW13
instr. 14VLIW14
instr. 15VLIW15

128 instructions of two wave fronts, each of which consists of 64 operations, are executed by 16 VLIW modules in eight clock cycles. Interleaving occurs, and each module in reality has two clock cycles to execute an entire instruction, provided that on the second clock cycle it starts executing a new one in parallel. This probably helps to quickly execute a VLIW instruction like a1×a2+b1×b2+c1×c2+d1×d2, that is, execute eight such instructions in eight clock cycles. (Formally, it turns out to be one per measure.)

Nvidia apparently does not have such technology. And in the absence of VLIW, high performance using scalar instructions requires high frequency operation, which automatically increases heat dissipation and places high demands on the process (to force the circuit to operate at a higher frequency).

The disadvantage of Radeon from the point of view of GPU computing is its great dislike for branching. GPUs generally do not favor branching due to the technology described above for executing instructions: at once in a group of threads with one program address. (By the way, this technique is called SIMT: Single Instruction - Multiple Threads (one instruction - many threads), by analogy with SIMD, where one instruction performs one operation with different data.) However, Radeon does not particularly like branching: this is caused by the larger size of the bundle of threads . It is clear that if the program is not completely vector, then the larger the size of the warp or wave front, the worse, since when neighboring threads diverge in their program paths, more groups are formed that must be executed sequentially (serialized). Let's say all the threads are scattered, then if the warp size is 32 threads, the program will work 32 times slower. And in the case of size 64, as in Radeon, it is 64 times slower.

This is a noticeable, but not the only manifestation of “hostility”. In Nvidia video cards, each functional module, otherwise called the CUDA core, has a special branch processing unit. And in Radeon video cards with 16 computing modules there are only two branch control units (they are removed from the domain of arithmetic units). So even simple processing of a conditional jump instruction, even if its result is the same for all threads in the wave front, takes additional time. And the speed sags.

AMD also produces CPUs. They believe that for programs with a large number of branches, the CPU is still better suited, while the GPU is intended for pure vector programs.

So Radeon provides less overall programming efficiency, but provides better price/performance in many cases. In other words, there are fewer programs that can be efficiently (profitably) migrated from a CPU to a Radeon than there are programs that can run efficiently on Fermi. But those that can be effectively transferred will work more efficiently on Radeon in many ways.

API for GPU computing

The technical specifications of Radeon themselves look attractive, although there is no need to idealize and absolutize GPU computing. But no less important for productivity is the software necessary for developing and executing a GPU program - compilers from a high-level language and run-time, that is, a driver that interacts between the part of the program running on the CPU and the GPU itself. It is even more important than in the case of a CPU: the CPU does not need a driver to manage data transfers, and from the point of view of the compiler, the GPU is more finicky. For example, the compiler must make do with a minimum number of registers to store intermediate results of calculations, and also carefully integrate function calls, again using a minimum of registers. After all, the fewer registers a thread uses, the more threads can be launched and the more fully the GPU can be loaded, better hiding memory access time.

And software support for Radeon products still lags behind hardware development. (Unlike the situation with Nvidia, where the release of hardware was delayed and the product was released in a stripped-down form.) Just recently, the OpenCL compiler produced by AMD had beta status, with many flaws. It generated erroneous code too often, or refused to compile code from the correct source code, or it itself produced an error and crashed. Only at the end of spring a release with high performance was released. It is also not without errors, but there are significantly fewer of them, and they tend to arise in lateral directions when trying to program something on the verge of correctness. For example, they work with the uchar4 type, which defines a 4-byte four-component variable. This type is in the OpenCL specifications, but it’s not worth working with it on Radeon, because the registers are 128-bit: the same four components, but 32-bit. And such a uchar4 variable will still occupy an entire register, it will only require additional packing operations and access to individual byte components. The compiler should not have any errors, but there are no compilers without flaws. Even Intel Compiler after 11 versions has compilation errors. The identified errors are corrected in the next release, which will be released closer to the fall.

But there are still many things that need improvement. For example, the standard Radeon GPU driver still does not support GPU computing using OpenCL. The user must download and install an additional special package.

But the most important thing is the absence of any function libraries. For double precision real numbers there is not even a sine, cosine or exponent. Well, this is not required for matrix addition and multiplication, but if you want to program something more complex, you need to write all the functions from scratch. Or wait for a new SDK release. ACML (AMD Core Math Library) for the Evergreen GPU family with support for basic matrix functions should be released soon.

At the moment, according to the author of the article, it seems feasible to use the Direct Compute 5.0 API for programming Radeon video cards, naturally taking into account the limitations: targeting the Windows 7 and Windows Vista platform. Microsoft has extensive experience in creating compilers, and we can expect a fully functional release very soon, Microsoft is directly interested in this. But Direct Compute is focused on the needs of interactive applications: to calculate something and immediately visualize the result - for example, the flow of liquid over a surface. This does not mean that it cannot be used simply for calculations, but that is not its natural purpose. Let's say Microsoft does not plan to add library functions to Direct Compute - just those that AMD does not currently have. That is, what can now be effectively calculated on Radeon - some not very sophisticated programs - can also be implemented on Direct Compute, which is much simpler than OpenCL and should be more stable. Plus, it's completely portable and will run on both Nvidia and AMD, so you only have to compile the program once, whereas Nvidia and AMD's OpenCL SDK implementations aren't entirely compatible. (In the sense that if you develop an OpenCL program on an AMD system using the AMD OpenCL SDK, it may not run as easily on Nvidia. You may need to compile the same text using the Nvidia SDK. And, of course, vice versa.)

Then, there is a lot of redundant functionality in OpenCL, since OpenCL is intended to be a universal programming language and API for a wide range of systems. And GPU, and CPU, and Cell. So in case you just need to write a program for a typical user system (processor plus video card), OpenCL does not seem to be “highly productive”, so to speak. Each function has ten parameters, and nine of them must be set to 0. And in order to set each parameter, you need to call a special function, which also has parameters.

And the most important current advantage of Direct Compute is that the user does not need to install a special package: everything that is needed is already in DirectX 11.

Problems of GPU computing development

If we take the sphere of personal computers, the situation is this: there are not many tasks that require large computing power and a conventional dual-core processor is greatly lacking. It was as if large, voracious but clumsy monsters had crawled out of the sea onto land, and there was almost nothing to eat on land. And the primordial abodes of the earth’s surface are decreasing in size, learning to consume less, as always happens when there is a shortage of natural resources. If there was the same need for performance now as 10-15 years ago, GPU computing would be a big hit. And so the problems of compatibility and the relative complexity of GPU programming come to the fore. It's better to write a program that runs on all systems than a program that runs fast but only runs on the GPU.

The prospects for GPUs are somewhat better in terms of use in professional applications and the workstation sector, since there is a greater need for performance there. There are plugins for 3D editors with GPU support: for example, for rendering using ray tracing - not to be confused with regular GPU rendering! Something is also emerging for 2D and presentation editors, with faster creation of complex effects. Video processing programs are also gradually gaining GPU support. The above tasks, due to their parallel nature, fit well with the GPU architecture, but now a very large code base has been created, debugged and optimized for all the capabilities of the CPU, so it will take time for good GPU implementations to appear.

In this segment, such weaknesses of GPUs also appear, such as the limited amount of video memory - approximately 1 GB for conventional GPUs. One of the main factors that reduce the performance of GPU programs is the need to exchange data between the CPU and GPU over a slow bus, and due to limited memory, more data must be transferred. And here AMD’s concept of combining GPU and CPU in one module looks promising: you can sacrifice the high bandwidth of graphics memory for easy and simple access to shared memory, and with lower latency. This high bandwidth of current DDR5 video memory is much more in demand directly from graphics programs than from most GPU computing programs. In general, the shared memory of the GPU and CPU will simply significantly expand the scope of the GPU, making it possible to use its computing capabilities in small subtasks of programs.

And GPUs are most in demand in the field of scientific computing. Several GPU-based supercomputers have already been built, which show very high results in the matrix operations test. Scientific problems are so diverse and numerous that there are always many that fit perfectly into the GPU architecture, for which the use of a GPU makes it easy to obtain high performance.

If you choose one among all the tasks of modern computers, it will be computer graphics - the image of the world in which we live. And the optimal architecture for this purpose cannot be bad. This is such an important and fundamental task that hardware specially designed for it must be universal and optimal for various tasks. Moreover, video cards are successfully evolving.

There can never be too many nuclei...

Modern GPUs are monstrous, fast beasts capable of chewing gigabytes of data. However, man is cunning and, no matter how much computing power grows, he comes up with more and more complex problems, so the moment comes when we sadly have to admit that optimization is needed 🙁

This article describes the basic concepts in order to make it easier to navigate the theory of gpu optimization and the basic rules so that these concepts have to be addressed less often.

Reasons why GPUs are effective for working with large amounts of data that require processing:

  • they have great capabilities for parallel execution of tasks (many, many processors)
  • high memory bandwidth

Memory bandwidth- this is how much information - a bit or a gigabyte - can be transferred per unit of time - a second or a processor cycle.

One of the optimization tasks is to use maximum throughput - to increase performance throughput(ideally it should be equal to memory bandwidth).

To improve bandwidth utilization:

  • increase the amount of information - use the bandwidth to its fullest (for example, each thread works with float4)
  • reduce latency – delay between operations

Latency– the period of time between the moments when the controller requested a specific memory cell and the moment when the data became available to the processor for executing instructions. We cannot influence the delay itself in any way - these limitations are present at the hardware level. It is due to this delay that the processor can simultaneously service several threads - while thread A has requested to allocate memory to it, thread B can calculate something, and thread C can wait until the requested data arrives to it.

How to reduce latency if synchronization is used:

  • reduce the number of threads in a block
  • increase the number of block groups

Full use of GPU resources – GPU Occupancy

In highbrow conversations about optimization, the term often appears - gpu occupancy or kernel occupancy– it reflects the efficiency of using the video card’s resources. I would like to separately note that even if you use all the resources, this does not mean that you are using them correctly.

The computing power of the GPU is hundreds of computationally hungry processors; when creating a program - the kernel - the burden of distributing the load falls on the shoulders of the programmer. A mistake can result in much of these precious resources sitting idle. Now I will explain why. We'll have to start from afar.

Let me remind you that warp ( warp in NVidia terminology, wavefront – in AMD terminology) is a set of threads that simultaneously perform the same kernel function on the processor. Threads united by the programmer into blocks are divided into warps by a thread scheduler (separately for each multiprocessor) - while one warp is working, the second is waiting for processing memory requests, etc. If some of the warp threads are still performing calculations, while others have already done everything they could, there is an inefficient use of the computing resource - popularly called idle capacity.

Each synchronization point, each branch of logic can generate such an idle situation. The maximum divergence (branching of execution logic) depends on the size of the warp. For NVidia GPUs it is 32, for AMD it is 64.

To reduce multiprocessor downtime during warp execution:

  • minimize barrier waiting time
  • minimize the divergence of execution logic in the kernel function

To effectively solve this problem, it makes sense to understand how warps are formed (for the case with several dimensions). In fact, the order is simple - first in X, then in Y and, lastly, in Z.

the kernel is launched with blocks of size 64x16, threads are divided into warps in the order X, Y, Z - i.e. the first 64 elements are divided into two warps, then the second, etc.

The kernel runs with 16x64 blocks. The first and second 16 elements are added to the first warp, the third and fourth - to the second warp, etc.

How to reduce divergence (remember, branching is not always the cause of critical performance loss)

  • when adjacent flows have different execution paths - there are many conditions and transitions along them - look for ways to restructure
  • look for an unbalanced load of threads and decisively remove it (this is when not only do we have conditions, but because of these conditions, the first thread always calculates something, and the fifth does not meet this condition and is idle)

How to make the most of your GPU resources

GPU resources, unfortunately, also have their limitations. And, strictly speaking, before launching the kernel function, it makes sense to determine the limits and take these limits into account when distributing the load. Why is it important?

Video cards have restrictions on the total number of threads that one multiprocessor can execute, the maximum number of threads in one block, the maximum number of warps on one processor, restrictions on various types of memory, etc. All this information can be requested either programmatically, through the appropriate API, or previously using utilities from the SDK. (deviceQuery modules for NVidia devices, CLInfo - for AMD video cards).

General practice:

  • the number of thread blocks/workgroups must be a multiple of the number of stream processors
  • block/workgroup size must be a multiple of the warp size

It should be taken into account that the absolute minimum is 3-4 warps/wayfronts spinning simultaneously on each processor; wise guides advise proceeding from the consideration of at least seven wayfronts. At the same time, do not forget the hardware restrictions!

Keeping all these details in your head quickly gets boring, so to calculate gpu-occupancy NVidia offered an unexpected tool - an Excel(!) calculator full of macros. There you can enter information on the maximum number of threads for SM, the number of registers and the size of the total (shared) memory available on the stream processor, and the function launch parameters used - and it displays the efficiency of resource use as a percentage (and you are tearing your hair out realizing that in order to use all cores you are missing registers).

Usage information:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#calculating-occupancy

GPU and memory operations

Video cards are optimized for 128-bit memory operations. Those. ideally, each memory manipulation should ideally change 4 four-byte values ​​at a time. The main trouble for a programmer is that modern GPU compilers do not know how to optimize such things. This has to be done directly in the function code and, on average, brings a fraction of a percent increase in performance. The frequency of memory requests has a much greater impact on performance.

The problem is this: each request returns a piece of data that is a multiple of 128 bits in size. And each thread uses only a quarter of it (in the case of a regular four-byte variable). When adjacent threads simultaneously work with data located sequentially in memory cells, this reduces the total number of memory accesses. This phenomenon is called combined read and write operations ( coalesced access – good! both read and write) – and with the correct organization of the code ( strided access to contiguous chunk of memory – bad!) can significantly improve performance. When organizing your core - remember - contiguous access - within the elements of one row of memory, working with column elements is no longer so efficient. Want more details? I liked this pdf - or google for “ memory coalescing techniques “.

The leading position in the “bottleneck” category is occupied by another memory operation – copying data from host memory to GPU . Copying does not happen anyhow, but from a memory area specially allocated by the driver and the system: when there is a request to copy data, the system first copies this data there, and only then uploads it to the GPU. The speed of data transportation is limited by the bandwidth of the PCI Express xN bus (where N is the number of data lines) through which modern video cards communicate with the host.

However, unnecessary copying of slow memory on the host is sometimes an unjustified cost. The solution is to use the so-called pinned memory – a specially marked memory area, so that the operating system is not able to perform any operations with it (for example, dump it into swap/move at its discretion, etc.). Data transfer from the host to the video card is carried out without the participation of the operating system - asynchronously, through DMA (direct memory access).

And finally, a little more about memory. Shared memory on a multiprocessor is usually organized in the form of memory banks containing 32-bit words - data. The number of banks, according to good tradition, varies from one GPU generation to another - 16/32 If each thread accesses a separate bank for data, everything is fine. Otherwise, we get several read/write requests to one bank and we get a conflict ( shared memory bank conflict). Such conflicting calls are serialized and therefore executed sequentially rather than in parallel. If all threads access one bank, a “broadcast” response is used ( broadcast) and there is no conflict. There are several ways to effectively deal with access conflicts, I liked it description of the main techniques for getting rid of access conflicts to memory banks – .

How to make math operations even faster? Remember that:

  • Double precision calculations are a high load operation with fp64 >> fp32
  • constants of the form 3.13 in the code, by default, are interpreted as fp64 if 3.14f is not explicitly specified
  • To optimize mathematics, it would be a good idea to check the guides to see if the compiler has any flags
  • Manufacturers include features in their SDKs that exploit device features to achieve performance (often at the expense of portability)

It makes sense for CUDA developers to pay close attention to the concept cuda stream allowing you to run several kernel functions on one device at once or combine asynchronous copying of data from the host to the device while executing functions. OpenCL does not yet provide such functionality :)

Scrap for profiling:

NVifia Visual Profiler is an interesting utility that analyzes both CUDA and OpenCL kernels.

P.S. As a more extensive guide to optimization, I can recommend googling all kinds of best practices guide for OpenCL and CUDA.

  • ,

Using GPU Computing with C++ AMP

So far, in discussing parallel programming techniques, we have considered only processor cores. We have gained some skills in parallelizing programs across multiple processors, synchronizing access to shared resources, and using high-speed synchronization primitives without using locks.

However, there is another way to parallelize programs - graphics processing units (GPUs), having more cores than even high-performance processors. GPU cores are excellent for implementing parallel data processing algorithms, and their large number more than pays for the inconvenience of running programs on them. In this article we will get acquainted with one of the ways to run programs on a GPU, using a set of C++ language extensions called C++AMP.

The C++ AMP extensions are based on the C++ language, which is why this article will demonstrate examples in C++. However, with moderate use of the interaction mechanism in. NET, you can use C++ AMP algorithms in your .NET programs. But we will talk about this at the end of the article.

Introduction to C++ AMP

Essentially, a GPU is the same processor as any other, but with a special set of instructions, a large number of cores and its own memory access protocol. However, there are big differences between modern GPUs and conventional processors, and understanding them is key to creating programs that effectively use the processing power of the GPU.

    Modern GPUs have a very small instruction set. This implies some limitations: lack of ability to call functions, limited set of supported data types, lack of library functions, and others. Some operations, such as conditional branches, can cost significantly more than similar operations performed on conventional processors. Obviously, moving large amounts of code from the CPU to the GPU under such conditions requires significant effort.

    The number of cores in the average GPU is significantly higher than in the average conventional processor. However, some tasks are too small or cannot be broken down into large enough parts to benefit from the GPU.

    Synchronization support between GPU cores performing the same task is very poor, and completely absent between GPU cores performing different tasks. This circumstance requires synchronization of the graphics processor with a conventional processor.

The question immediately arises: what tasks are suitable for solving on a GPU? Keep in mind that not every algorithm is suitable for execution on a GPU. For example, GPUs don't have access to I/O devices, so you won't be able to improve the performance of a program that scrapes RSS feeds from the Internet by using a GPU. However, many computational algorithms can be transferred to the GPU and can be massively parallelized. Below are a few examples of such algorithms (this list is by no means complete):

    increasing and decreasing sharpness of images, and other transformations;

    fast Fourier transform;

    matrix transposition and multiplication;

    number sorting;

    direct hash inversion.

An excellent source for additional examples is the Microsoft Native Concurrency blog, which provides code snippets and explanations for various algorithms implemented in C++ AMP.

C++ AMP is a framework included with Visual Studio 2012 that gives C++ developers an easy way to perform computations on the GPU, requiring only a DirectX 11 driver. Microsoft has released C++ AMP as an open specification that can be implemented by any compiler vendor.

The C++ AMP framework allows you to run code in graphics accelerators, which are computing devices. Using the DirectX 11 driver, the C++ AMP framework dynamically detects all accelerators. C++ AMP also includes a software accelerator emulator and a conventional processor-based emulator, WARP, which serves as a fallback on systems without a GPU or with a GPU but lacks a DirectX 11 driver, and uses multiple cores and SIMD instructions.

Now let's start exploring an algorithm that can easily be parallelized for execution on a GPU. The implementation below takes two vectors of equal length and calculates the pointwise result. It's hard to imagine anything more straightforward:

Void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( for (int i = 0; i< length; ++i) { result[i] = first[i] + exp(second[i]); } }

To parallelize this algorithm on a regular processor, you need to split the iteration range into several subranges and run one thread of execution for each of them. We've spent a lot of time in previous articles on exactly this way of parallelizing our first prime number search example - we've seen how it can be done by creating threads manually, passing jobs to a thread pool, and using Parallel.For and PLINQ to automatically parallelize. Remember also that when parallelizing similar algorithms on a conventional processor, we took special care not to split the problem into too small tasks.

For the GPU, these warnings are not needed. GPUs have multiple cores that execute threads very quickly, and the cost of context switching is significantly lower than conventional processors. Below is a snippet trying to use the function parallel_for_each from the C++ AMP framework:

#include #include using namespace concurrency; void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( array_view avFirst(length, first); array_view avSecond(length, second); array_view avResult(length, result); avResult.discard_data(); parallel_for_each(avResult.extent, [=](index<1>i) restrict(amp) ( avResult[i] = avFirst[i] + fast_math::exp(avSecond[i]); )); avResult.synchronize(); )

Now let's examine each part of the code separately. Let's immediately note that the general form of the main loop has been preserved, but the originally used for loop has been replaced by a call to the parallel_for_each function. In fact, the principle of converting a loop into a function or method call is not new to us - such a technique has previously been demonstrated using the Parallel.For() and Parallel.ForEach() methods from the TPL library.

Next, the input data (parameters first, second and result) are wrapped with instances array_view. The array_view class is used to wrap data passed to the GPU (accelerator). Its template parameter specifies the data type and its dimension. In order to execute instructions on a GPU that access data originally processed on a conventional CPU, someone or something must take care of copying the data to the GPU because most modern graphics cards are separate devices with their own memory. array_view instances solve this problem - they provide data copying on demand and only when it is really needed.

When the GPU completes the task, the data is copied back. By instantiating array_view with a const argument, we ensure that first and second are copied into GPU memory, but not copied back. Likewise, calling discard_data(), we exclude copying result from the memory of a regular processor to the accelerator memory, but this data will be copied in the opposite direction.

The parallel_for_each function takes an extent object that specifies the form of the data to be processed and a function to apply to each element in the extent object. In the example above, we used a lambda function, support for which appeared in the ISO C++2011 (C++11) standard. The restrict (amp) keyword instructs the compiler to check whether the function body can be executed on the GPU and disables most C++ syntax that cannot be compiled into GPU instructions.

Lambda function parameter, index<1>object, represents a one-dimensional index. It must match the extent object being used - if we were to declare the extent object to be two-dimensional (for example, by defining the shape of the source data as a two-dimensional matrix), the index would also need to be two-dimensional. An example of such a situation is given below.

Finally, the method call synchronize() at the end of the VectorAddExpPointwise method, it ensures that the calculation results from array_view avResult, produced by the GPU, are copied back to the result array.

This concludes our first introduction to the world of C++ AMP, and now we are ready for more detailed research, as well as more interesting examples demonstrating the benefits of using parallel computing on a GPU. Vector addition is not a good algorithm and is not the best candidate for demonstrating GPU usage due to the large overhead of copying data. The next subsection will show two more interesting examples.

Matrix multiplication

The first "real" example we'll look at is matrix multiplication. For implementation, we will take a simple cubic matrix multiplication algorithm, and not the Strassen algorithm, which has a execution time close to cubic ~O(n 2.807). Given two matrices, an m x w matrix A and a w x n matrix B, the following program will multiply them and return the result, an m x n matrix C:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( for (int i = 0; i< m; ++i) { for (int j = 0; j < n; ++j) { int sum = 0; for (int k = 0; k < w; ++k) { sum += A * B; } C = sum; } } }

There are several ways to parallelize this implementation, and if you want to parallelize this code to run on a regular processor, the right choice would be to parallelize the outer loop. However, the GPU has a fairly large number of cores, and by parallelizing only the outer loop, we will not be able to create a sufficient number of jobs to load all the cores with work. Therefore, it makes sense to parallelize the two outer loops, leaving the inner loop untouched:

Void MatrixMultiply (int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent, [=](index<2>idx) restrict(amp) ( int sum = 0; for (int k = 0; k< w; ++k) { sum + = avA(idx*w, k) * avB(k*w, idx); } avC = sum; }); }

This implementation still closely resembles the sequential implementation of matrix multiplication and the vector addition example given above, with the exception of the index, which is now two-dimensional and accessible in the inner loop using the operator. How much faster is this version than the sequential alternative running on a regular processor? Multiplying two matrices (integers) of size 1024 x 1024, the sequential version on a regular CPU takes an average of 7350 milliseconds, while the GPU version - hold on tight - takes 50 milliseconds, 147 times faster!

Particle motion simulation

The examples of solving problems on the GPU presented above have a very simple implementation of the internal loop. It is clear that this will not always be the case. The Native Concurrency blog, linked above, demonstrates an example of modeling gravitational interactions between particles. The simulation involves an infinite number of steps; at each step, new values ​​of the elements of the acceleration vector are calculated for each particle and then their new coordinates are determined. Here, the particle vector is parallelized - with a sufficiently large number of particles (from several thousand and above), you can create a sufficiently large number of tasks to load all the GPU cores with work.

The basis of the algorithm is the implementation of determining the result of interactions between two particles, as shown below, which can easily be transferred to the GPU:

// here float4 are vectors with four elements // representing the particles involved in the operations void bodybody_interaction (float4& acceleration, const float4 p1, const float4 p2) restrict(amp) ( float4 dist = p2 – p1; // no w here used float absDist = dist.x*dist.x + dist.y*dist.y + dist.z*dist.z; float invDist = 1.0f / sqrt(absDist); float invDistCube = invDist*invDist*invDist; acceleration + = dist*PARTICLE_MASS*invDistCube; )

The initial data at each modeling step is an array with the coordinates and velocities of particles, and as a result of calculations, a new array with the coordinates and velocities of particles is created:

Struct particle ( float4 position, velocity; // implementations of constructor, copy constructor and // operator = with restrict(amp) omitted to save space ); void simulation_step(array & previous, array & next, int bodies) ( extent<1>ext(bodies); parallel_for_each (ext, [&](index<1>idx) restrict(amp) ( particle p = previous; float4 acceleration(0, 0, 0, 0); for (int body = 0; body< bodies; ++body) { bodybody_interaction (acceleration, p.position, previous.position); } p.velocity + = acceleration*DELTA_TIME; p.position + = p.velocity*DELTA_TIME; next = p; }); }

With the help of an appropriate graphical interface, modeling can be very interesting. The full example provided by the C++ AMP team can be found on the Native Concurrency blog. On my system with an Intel Core i7 processor and a Geforce GT 740M graphics card, the simulation of 10,000 particles runs at ~2.5 fps (steps per second) using the sequential version running on the regular processor, and 160 fps using the optimized version running on the GPU - a huge increase in performance.

Before we wrap up this section, there is one more important feature of the C++ AMP framework that can further improve the performance of code running on the GPU. GPUs support programmable data cache(often called shared memory). The values ​​stored in this cache are shared by all threads of execution in a single tile. Thanks to memory tiling, programs based on the C++ AMP framework can read data from graphics card memory into the shared memory of the mosaic and then access it from multiple threads of execution without re-fetching the data from graphics card memory. Accessing mosaic shared memory is approximately 10 times faster than graphics card memory. In other words, you have reasons to keep reading.

To provide a tiled version of the parallel loop, the parallel_for_each method is passed domain tiled_extent, which divides the multidimensional extent object into multidimensional tiles, and the tiled_index lambda parameter, which specifies the global and local ID of the thread within the tile. For example, a 16x16 matrix can be divided into 2x2 tiles (as shown in the image below) and then passed to the parallel_for_each function:

Extent<2>matrix(16,16); tiled_extent<2,2>tiledMatrix = matrix.tile<2,2>(); parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // ... ));

Each of the four threads of execution belonging to the same mosaic can share the data stored in the block.

When performing operations with matrices, in the GPU core, instead of the standard index<2>, as in the examples above, you can use idx.global. Proper use of local tiled memory and local indexes can provide significant performance gains. To declare tiled memory shared by all threads of execution in a single tile, local variables can be declared with the tile_static specifier.

In practice, the technique of declaring shared memory and initializing its individual blocks in different threads of execution is often used:

Parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; ));

Obviously, any benefits from using shared memory can only be obtained if access to this memory is synchronized; that is, threads must not access memory until it has been initialized by one of them. Synchronization of threads in a mosaic is performed using objects tile_barrier(reminiscent of the Barrier class from the TPL library) - they will be able to continue execution only after calling the tile_barrier.Wait() method, which will return control only when all threads have called tile_barrier.Wait. For example:

Parallel_for_each(tiledMatrix, (tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; // idx.barrier is an instance of tile_barrier idx.barrier.wait(); // Now this thread can access the "local" array // using the indexes of other threads of execution! ));

Now is the time to translate what you have learned into a concrete example. Let's return to the implementation of matrix multiplication, performed without the use of tiling memory organization, and add the described optimization to it. Let's assume that the matrix size is a multiple of 256 - this will allow us to work with 16 x 16 blocks. The nature of matrices allows for block-by-block multiplication, and we can take advantage of this feature (in fact, dividing matrices into blocks is a typical optimization of the matrix multiplication algorithm, providing more efficient CPU cache usage).

The essence of this technique comes down to the following. To find C i,j (the element in row i and column j in the result matrix), you need to calculate the dot product between A i,* (i-th row of the first matrix) and B *,j (j-th column in the second matrix ). However, this is equivalent to computing the partial dot products of the row and column and then summing the results. We can use this fact to convert the matrix multiplication algorithm into a tiling version:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent.tile<16,16>(), [=](tiled_index<16,16>idx) restrict(amp) ( int sum = 0; int localRow = idx.local, localCol = idx.local; for (int k = 0; k

The essence of the described optimization is that each thread in the mosaic (256 threads are created for a 16 x 16 block) initializes its element in 16 x 16 local copies of fragments of the original matrices A and B. Each thread in the mosaic requires only one row and one column of these blocks, but all threads together will access each row and each column 16 times. This approach significantly reduces the number of accesses to main memory.

To calculate element (i,j) in the result matrix, the algorithm requires the complete i-th row of the first matrix and the j-th column of the second matrix. When the threads are 16x16 tiling represented in the diagram and k=0, the shaded regions in the first and second matrices will be read into shared memory. The execution thread computing element (i,j) in the result matrix will calculate the partial dot product of the first k elements from the i-th row and j-th column of the original matrices.

In this example, using a tiled organization provides a huge performance boost. The tiled version of matrix multiplication is much faster than the simple version, taking approximately 17 milliseconds (for the same 1024 x 1024 input matrices), which is 430 times faster than the version running on a conventional processor!

Before we end our discussion of the C++ AMP framework, we would like to mention the tools (in Visual Studio) available to developers. Visual Studio 2012 offers a graphics processing unit (GPU) debugger that lets you set breakpoints, examine the call stack, and read and change local variable values ​​(some accelerators support GPU debugging directly; for others, Visual Studio uses a software simulator), and a profiler that lets you evaluate the benefits an application receives from parallelizing operations using a GPU. For more information about debugging capabilities in Visual Studio, see the Walkthrough article. Debugging a C++ AMP Application" on MSDN.

GPU Computing Alternatives in .NET

So far this article has only shown examples in C++, however, there are several ways to harness the power of the GPU in managed applications. One way is to use interop tools that allow you to offload work with GPU cores to low-level C++ components. This solution is great for those who want to use the C++ AMP framework or have the ability to use pre-built C++ AMP components in managed applications.

Another way is to use a library that works directly with the GPU from managed code. There are currently several such libraries. For example, GPU.NET and CUDAfy.NET (both commercial offerings). Below is an example from the GPU.NET GitHub repository demonstrating the implementation of the dot product of two vectors:

Public static void MultiplyAddGpu(double a, double b, double c) ( int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X; int TotalThreads = BlockDimension.X * GridDimension.X; for (int ElementIdx = ThreadId; ElementIdx

I am of the opinion that it is much easier and more efficient to learn a language extension (based on C++ AMP) than to try to orchestrate interactions at the library level or make significant changes to the IL language.

So, after we looked at the possibilities of parallel programming in .NET and using the GPU, no one doubts that organizing parallel computing is an important way to increase productivity. In many servers and workstations around the world, the invaluable processing power of CPUs and GPUs goes unused because applications simply don't use it.

The Task Parallel Library gives us a unique opportunity to include all available CPU cores, although this will require solving some interesting problems of synchronization, excessive task fragmentation, and unequal distribution of work between execution threads.

The C++ AMP framework and other multi-purpose GPU parallel computing libraries can be successfully used to parallelize calculations across hundreds of GPU cores. Finally, there is a previously unexplored opportunity to gain productivity gains from the use of cloud distributed computing technologies, which have recently become one of the main directions in the development of information technology.

GPU Computing

CUDA technology (Compute Unified Device Architecture) is a software and hardware architecture that allows computing using NVIDIA graphics processors that support GPGPU (random computing on video cards) technology. The CUDA architecture first appeared on the market with the release of the eighth generation NVIDIA chip - G80 and is present in all subsequent series of graphics chips that are used in the GeForce, ION, Quadro and Tesla accelerator families.

CUDA SDK allows programmers to implement algorithms that can be executed on NVIDIA GPUs in a special simplified dialect of the C programming language and to include special functions in the text of a C program. CUDA gives the developer the opportunity, at his own discretion, to organize access to the instruction set of the graphics accelerator and manage its memory, and organize complex parallel calculations on it.

Story

In 2003, Intel and AMD were in a joint race to find the most powerful processor. Over several years, as a result of this race, clock speeds increased significantly, especially after the release of the Intel Pentium 4.

After the increase in clock frequencies (between 2001 and 2003, the clock frequency of the Pentium 4 doubled from 1.5 to 3 GHz), and users had to be content with tenths of gigahertz, which were brought to the market by manufacturers (from 2003 to 2005, clock frequencies increased 3 to 3.8 GHz).

Architectures optimized for high clock frequencies, such as Prescott, also began to experience difficulties, and not only production ones. Chip makers are faced with challenges in overcoming the laws of physics. Some analysts even predicted that Moore's Law would cease to apply. But that did not happen. The original meaning of the law is often distorted, but it concerns the number of transistors on the surface of the silicon core. For a long time, an increase in the number of transistors in a CPU was accompanied by a corresponding increase in performance - which led to a distortion of the meaning. But then the situation became more complicated. The developers of the CPU architecture approached the law of growth reduction: the number of transistors that needed to be added for the required increase in performance became increasingly large, leading to a dead end.

The reason why GPU manufacturers have not encountered this problem is very simple: CPUs are designed to get maximum performance on a stream of instructions that process different data (both integer and floating point numbers), perform random memory access, etc. d. Until now, developers are trying to provide greater parallelism of instructions - that is, execute as many instructions as possible in parallel. For example, with the Pentium, superscalar execution appeared, when under certain conditions it was possible to execute two instructions per clock cycle. Pentium Pro received out-of-order execution of instructions, which made it possible to optimize the operation of computing units. The problem is that there are obvious limitations to executing a sequential stream of instructions in parallel, so blindly increasing the number of computational units does not provide any benefit since they will still be idle most of the time.

The operation of the GPU is relatively simple. It consists of taking a group of polygons on one side and generating a group of pixels on the other. Polygons and pixels are independent of each other, so they can be processed in parallel. Thus, in a GPU it is possible to allocate a large part of the crystal into computational units, which, unlike the CPU, will actually be used.

GPU differs from CPU not only in this way. Memory access in the GPU is very coupled - if a texel is read, then after a few clock cycles the neighboring texel will be read; When a pixel is recorded, after a few clock cycles the neighboring one will be recorded. By intelligently organizing memory, you can achieve performance close to theoretical throughput. This means that the GPU, unlike the CPU, does not require a huge cache, since its role is to speed up texturing operations. All that is needed is a few kilobytes containing a few texels used in bilinear and trilinear filters.

First calculations on GPU

The earliest attempts at such applications were limited to the use of certain hardware functions, such as rasterization and Z-buffering. But in the current century, with the advent of shaders, matrix calculations began to be accelerated. In 2003, at SIGGRAPH, a separate section was allocated for GPU computing, and it was called GPGPU (General-Purpose computation on GPU).

The best known is BrookGPU, a compiler for the Brook streaming programming language, designed to perform non-graphical computations on the GPU. Before its appearance, developers using the capabilities of video chips for calculations chose one of two common APIs: Direct3D or OpenGL. This seriously limited the use of GPUs, because 3D graphics use shaders and textures that parallel programming specialists are not required to know about; they use threads and cores. Brook was able to help make their task easier. These streaming extensions to the C language, developed at Stanford University, hid the 3D API from programmers, and presented the video chip as a parallel coprocessor. The compiler processed the .br file with C++ code and extensions, producing code linked to a DirectX, OpenGL, or x86-enabled library.

The appearance of Brook aroused interest among NVIDIA and ATI and subsequently opened up a whole new sector of it - parallel computers based on video chips.

Subsequently, some researchers from the Brook project joined the NVIDIA development team to introduce a hardware-software parallel computing strategy, opening up new market share. And the main advantage of this NVIDIA initiative is that developers know all the capabilities of their GPUs down to the last detail, and there is no need to use the graphics API, and you can work with the hardware directly using the driver. The result of this team's efforts was NVIDIA CUDA.

Areas of application of parallel calculations on GPU

When transferring calculations to the GPU, many tasks achieve acceleration of 5-30 times compared to fast universal processors. The largest numbers (on the order of 100x speedup or even more!) are achieved with code that is not very suitable for calculations using SSE blocks, but is quite convenient for GPUs.

These are just some examples of speedups for synthetic code on the GPU versus SSE-vectorized code on the CPU (according to NVIDIA):

Fluorescence microscopy: 12x.

Molecular dynamics (non-bonded force calc): 8-16x;

Electrostatics (direct and multilevel Coulomb summation): 40-120x and 7x.

A table that NVIDIA displays in all presentations shows the speed of GPUs relative to CPUs.

List of the main applications in which GPU computing is used: analysis and processing of images and signals, physics simulation, computational mathematics, computational biology, financial calculations, databases, dynamics of gases and liquids, cryptography, adaptive radiation therapy, astronomy, audio processing, bioinformatics , biological simulations, computer vision, data mining, digital cinema and television, electromagnetic simulations, geographic information systems, military applications, mine planning, molecular dynamics, magnetic resonance imaging (MRI), neural networks, oceanographic research, particle physics, protein folding simulation, quantum chemistry, ray tracing, visualization, radar, reservoir simulation, artificial intelligence, satellite data analysis, seismic exploration, surgery, ultrasound, video conferencing.

Advantages and Limitations of CUDA

From a programmer's perspective, a graphics pipeline is a collection of processing stages. The geometry block generates the triangles, and the rasterization block generates the pixels displayed on the monitor. The traditional GPGPU programming model looks like this:

To transfer calculations to the GPU within this model, a special approach is needed. Even element-wise addition of two vectors will require drawing the figure on the screen or to an off-screen buffer. The figure is rasterized, the color of each pixel is calculated using a given program (pixel shader). The program reads the input data from the textures for each pixel, adds them and writes them to the output buffer. And all these numerous operations are needed for something that is written in a single operator in a regular programming language!

Therefore, the use of GPGPU for general purpose computing has the limitation of being too difficult to train developers. And there are enough other restrictions, because a pixel shader is just a formula for the dependence of the final color of a pixel on its coordinate, and the language of pixel shaders is a language for writing these formulas with a C-like syntax. Early GPGPU methods are a neat trick that allows you to use the power of the GPU, but without any of the convenience. The data there is represented by images (textures), and the algorithm is represented by the rasterization process. Of particular note is the very specific model of memory and execution.

NVIDIA's software and hardware architecture for GPU computing differs from previous GPGPU models in that it allows you to write programs for the GPU in real C language with standard syntax, pointers and the need for a minimum of extensions to access the computing resources of video chips. CUDA is independent of graphics APIs, and has some features designed specifically for general purpose computing.

Advantages of CUDA over the traditional approach to GPGPU computing

CUDA provides access to 16 KB of thread-shared memory per multiprocessor, which can be used to organize a cache with higher bandwidth than texture fetches;

More efficient data transfer between system and video memory;

No need for graphical APIs with redundancy and overhead;

Linear memory addressing, gather and scatter, ability to write to arbitrary addresses;

Hardware support for integer and bit operations.

Main limitations of CUDA:

Lack of recursion support for executable functions;

Minimum block width is 32 threads;

Closed CUDA architecture owned by NVIDIA.

The weaknesses of programming with previous GPGPU methods are that these methods do not use vertex shader execution units in previous non-unified architectures, data is stored in textures and output to an off-screen buffer, and multi-pass algorithms use pixel shader units. GPGPU limitations can include: insufficient use of hardware capabilities, memory bandwidth limitations, lack of scatter operation (gather only), mandatory use of the graphics API.

The main advantages of CUDA over previous GPGPU methods stem from the fact that the architecture is designed to make efficient use of non-graphics computing on the GPU and uses the C programming language without requiring algorithms to be ported to a graphics pipeline concept-friendly form. CUDA offers a new path to GPU computing that does not use graphics APIs, offering random memory access (scatter or gather). This architecture does not have the disadvantages of GPGPU and uses all execution units, and also expands capabilities due to integer mathematics and bit shift operations.

CUDA opens up some hardware capabilities not available from graphics APIs, such as shared memory. This is a small memory (16 kilobytes per multiprocessor) that thread blocks have access to. It allows you to cache the most frequently accessed data and can provide faster speeds than using texture fetches for this task. Which, in turn, reduces the throughput sensitivity of parallel algorithms in many applications. For example, it is useful for linear algebra, fast Fourier transform, and image processing filters.

Memory access is also more convenient in CUDA. The graphics API code outputs data as 32 single-precision floating-point values ​​(RGBA values ​​simultaneously into eight render targets) into predefined areas, and CUDA supports scatter writing - an unlimited number of records at any address. Such advantages make it possible to execute some algorithms on the GPU that cannot be efficiently implemented using GPGPU methods based on graphics APIs.

Also, graphics APIs necessarily store data in textures, which requires preliminary packaging of large arrays into textures, which complicates the algorithm and forces the use of special addressing. And CUDA allows you to read data at any address. Another advantage of CUDA is the optimized data exchange between the CPU and GPU. And for developers who want low-level access (for example, when writing another programming language), CUDA offers low-level assembly language programming capabilities.

Disadvantages of CUDA

One of the few disadvantages of CUDA is its poor portability. This architecture only works on video chips from this company, and not on all of them, but starting with the GeForce 8 and 9 series and the corresponding Quadro, ION and Tesla. NVIDIA cites a figure of 90 million CUDA-compatible video chips.

CUDA Alternatives

A framework for writing computer programs related to parallel computing on various graphics and central processors. The OpenCL framework includes a programming language that is based on the C99 standard and an application programming interface (API). OpenCL provides instruction-level and data-level parallelism and is an implementation of the GPGPU technique. OpenCL is a completely open standard and is royalty-free for use.

The goal of OpenCL is to complement OpenGL and OpenAL, which are open industry standards for 3D computer graphics and audio, by taking advantage of the power of the GPU. OpenCL is developed and maintained by the non-profit consortium Khronos Group, which includes many large companies, including Apple, AMD, Intel, nVidia, Sun Microsystems, Sony Computer Entertainment and others.

CAL/IL(Compute Abstraction Layer/Intermediate Language)

ATI Stream Technology is a set of hardware and software technologies that enable AMD GPUs to be used in conjunction with a CPU to accelerate many applications (not just graphics).

Applications for ATI Stream include computationally intensive applications such as financial analysis or seismic data processing. The use of a stream processor made it possible to increase the speed of some financial calculations by 55 times compared to solving the same problem using only the central processor.

NVIDIA does not consider ATI Stream technology a very strong competitor. CUDA and Stream are two different technologies that are at different levels of development. Programming for ATI products is much more complex - their language is more like assembly language. CUDA C, in turn, is a much more high-level language. Writing on it is more convenient and easier. This is very important for large development companies. If we talk about performance, we can see that its peak value in ATI products is higher than in NVIDIA solutions. But again it all comes down to how to get this power.

DirectX11 (DirectCompute)

An application programming interface that is part of DirectX, a set of APIs from Microsoft that is designed to run on IBM PC-compatible computers running Microsoft Windows operating systems. DirectCompute is designed to perform general-purpose computing on GPUs, an implementation of the GPGPU concept. DirectCompute was originally published as part of DirectX 11, but later became available for DirectX 10 and DirectX 10.1.

NVDIA CUDA in the Russian scientific community.

As of December 2009, the CUDA software model is taught in 269 universities around the world. In Russia, training courses on CUDA are given at Moscow, St. Petersburg, Kazan, Novosibirsk and Perm State Universities, the International University of the Nature of Society and Man "Dubna", the Joint Institute for Nuclear Research, the Moscow Institute of Electronic Technology, Ivanovo State Energy University, BSTU. V. G. Shukhov, MSTU im. Bauman, Russian Chemical Technical University named after. Mendeleev, Russian Scientific Center "Kurchatov Institute", Interregional Supercomputer Center of the Russian Academy of Sciences, Taganrog Technological Institute (TTI SFU).

Speaking about parallel computing on GPUs, we must remember what time we live in, today is a time when everything in the world is accelerated so much that you and I lose track of time, not noticing how it rushes by. Everything we do is associated with high accuracy and speed of information processing, in such conditions we certainly need tools in order to process all the information that we have and convert it into data, besides, when talking about such tasks we must remember that these tasks are necessary not only for large organizations or mega-corporations, but also for ordinary users who solve their life problems related to high technology at home on personal computers! The emergence of NVIDIA CUDA was not surprising, but rather justified, because it will soon be necessary to process significantly more time-consuming tasks on the PC than before. Work that previously took a lot of time will now take a matter of minutes, and accordingly this will affect the overall picture of the whole world!

What is GPU computing?

GPU computing is the use of the GPU to calculate technical, scientific, and everyday tasks. GPU computing involves the use of the CPU and GPU with heterogeneous sampling between them, namely: the sequential part of the programs is taken over by the CPU, while time-consuming computational tasks are left to the GPU. Thanks to this, task parallelization occurs, which leads to faster information processing and reduces work execution time; the system becomes more productive and can simultaneously process a larger number of tasks than before. However, in order to achieve such success, hardware support alone is not enough; in this case, software support is also necessary so that the application can transfer the most time-consuming calculations to the GPU.

What is CUDA

CUDA is a technology for programming algorithms in the simplified C language that are executed on graphics processors of GeForce accelerators of the eighth generation and older, as well as corresponding Quadro and Tesla cards from NVIDIA. CUDA allows you to include special functions in the text of a C program. These functions are written in the simplified C programming language and executed on the GPU. The initial version of the CUDA SDK was introduced on February 15, 2007. To successfully translate code in this language, the CUDA SDK includes NVIDIA's own nvcc command line C compiler. The nvcc compiler is based on the open Open64 compiler and is designed to translate host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for assembling the final program or library in any programming environment, such as Microsoft Visual Studio.

Technology capabilities

  1. A standard C language for parallel application development on GPUs.
  2. Ready-made numerical analysis libraries for fast Fourier transform and basic linear algebra software package.
  3. Special CUDA driver for computing with fast data transfer between GPU and CPU.
  4. Ability to interface the CUDA driver with OpenGL and DirectX graphics drivers.
  5. Support for Linux 32/64-bit, Windows XP 32/64-bit and MacOS operating systems.

Benefits of technology

  1. The CUDA Application Programming Interface (CUDA API) is based on the standard C programming language with some limitations. This simplifies and smoothes the process of learning the CUDA architecture.
  2. The 16 KB shared memory between threads can be used for a user-organized cache with a wider bandwidth than when fetching from regular textures.
  3. More efficient transactions between CPU memory and video memory.
  4. Full hardware support for integer and bitwise operations.

Example of technology application

cRark

The most time-consuming part of this program is the tincture. The program has a console interface, but thanks to the instructions that come with the program itself, you can use it. Below are brief instructions for setting up the program. We will test the program for functionality and compare it with another similar program that does not use NVIDIA CUDA, in this case the well-known “Advanced Archive Password Recovery” program.

From the downloaded cRark archive we need only three files: crark.exe, crark-hp.exe and password.def. Crerk.exe is a console utility for opening RAR 3.0 passwords without encrypted files inside the archive (i.e., when opening the archive, we see the names, but cannot unpack the archive without a password).

Crerk-hp.exe is a console utility for opening RAR 3.0 passwords with encryption of the entire archive (i.e., when opening the archive, we do not see either the name or the archives themselves and cannot unpack the archive without a password).

Password.def is any renamed text file with very little content (for example: 1st line: ## 2nd line: ?* , in this case the password will be cracked using all characters). Password.def is the director of the cRark program. The file contains the rules for cracking the password (or the area of ​​characters that crark.exe will use in its work). More details about the possibilities for choosing these characters are written in the text file obtained when opening the one downloaded from the website of the author of the cRark program: russian.def.

Preparation

I’ll say right away that the program only works if your video card is based on a GPU that supports the CUDA 1.1 acceleration level. So a series of video cards based on the G80 chip, such as the GeForce 8800 GTX, are no longer needed, since they have hardware support for CUDA 1.0 acceleration. The program selects only passwords for RAR archives of versions 3.0+ using CUDA. It is necessary to install all CUDA related software, namely:

We create any folder in any place (for example on the C: drive) and call it any name, for example “3.2”. We place the files there: crark.exe, crark-hp.exe and password.def and a password-protected/encrypted RAR archive.

Next, you should launch the Windows command line console and go to the created folder. In Windows Vista and 7, you should call the “Start” menu and enter “cmd.exe” in the search field; in Windows XP, from the “Start” menu, you should first call the “Run” dialog and enter “cmd.exe” in it. After opening the console, enter a command like: cd C:\folder\, cd C:\3.2 in this case.

We type two lines in a text editor (you can also save the text as a .bat file in the folder with cRark) to guess the password for a password-protected RAR archive with unencrypted files:

echo off;
cmd /K crark (archive name).rar

to guess the password of a password-protected and encrypted RAR archive:

echo off;
cmd /K crark-hp (archive name).rar

Copy 2 lines of the text file to the console and press Enter (or run the .bat file).

results

The decryption process is shown in the figure:

The speed of guessing on cRark using CUDA was 1625 passwords/second. In one minute thirty-six seconds, a password with 3 characters was selected: “q)$.” For comparison: the search speed in Advanced Archive Password Recovery on my dual-core Athlon 3000+ processor is a maximum of 50 passwords/second and the search should have lasted 5 hours. That is, bruteforce selection of a RAR archive in cRark using a GeForce 9800 GTX+ video card is 30 times faster than on a CPU.

For those with an Intel processor, a good motherboard with a high system bus frequency (FSB 1600 MHz), the CPU rate and search speed will be higher. And if you have a quad-core processor and a pair of video cards of the GeForce 280 GTX level, then the speed of brute-forcing passwords will speed up significantly. To summarize the example, it must be said that this problem was solved using CUDA technology in just 2 minutes instead of 5 hours, which indicates the high potential of this technology!

conclusions

Having examined the technology for parallel computing CUDA today, we clearly saw the power and enormous potential for the development of this technology using the example of a program for password recovery for RAR archives. It must be said about the prospects of this technology, this technology will certainly find a place in the life of every person who decides to use it, be it scientific tasks, or tasks related to video processing, or even economic tasks that require quick, accurate calculations, all this will lead to the inevitable an increase in labor productivity that cannot be ignored. Today, the phrase “home supercomputer” is already beginning to enter the lexicon; It is absolutely obvious that to make such an item a reality, every home already has a tool called CUDA. Since the release of cards based on the G80 chip (2006), a huge number of NVIDIA-based accelerators have been released that support CUDA technology, which can make dreams of supercomputers in every home come true. By promoting CUDA technology, NVIDIA raises its authority in the eyes of customers in the form of providing additional capabilities to their equipment, which many have already purchased. We can only believe that CUDA will soon develop very quickly and allow users to take full advantage of all the capabilities of parallel computing on GPUs.







2024 gtavrl.ru.