Parallel and sequential computing. Types of parallel computing

Useful tips

Typically, a parallel program algorithm is a sequence of parallel and sequential sections. The parallel part of the program includes data distribution and data exchange, according to schemes determined by the parallel algorithm of the program. The sequential part of the program, as a rule, determines the arithmetic processing of data on all or individual processes.

Amdahl's Law- illustrates the limitation of the growth of computing system performance with an increase in the number of computers.

“When a task is divided into several parts, the total execution time on a parallel system cannot be less than the execution time of the longest fragment.” According to this law, speeding up program execution due to parallelization its instructions on multiple computers are limited by the time required to execute its successive instructions.

Let it be necessary to solve some computational problem. Let's assume that its algorithm is such that a fraction of the total computation can be obtained only by sequential calculations, and, accordingly, the fraction can be ideally parallelized (that is, the computation time will be inversely proportional to the number of nodes involved). Then the acceleration that can be obtained on a computing system consisting of processors, compared to a single-processor solution, will not exceed

Let the processors be homogeneous in performance. T 0 – execution time of the sequential part of a parallel algorithm, for example, generating initial data and processing the result obtained when solving a problem. T 1 , T 2 , ... T p– time of sequential work performed by each processor without interaction with each other. Then the task execution time is p processors is determined by the inequality:

where i=1, 2, ... p. Equality occurs when Ti equal to each other. From here, substituting
T seq – T 0 , where T seq is the execution time of a task on one processor, we get T par ≥ T 0
.

Divide by T seq and, denoting by f = T 0 / T seq - the fraction (fraction) of the sequential section in the total volume of calculations, we obtain:
.(1.1)

Acceleration (Speedup) is the ratio of the task execution time in sequential mode (on 1 processor) to the task execution time in parallel mode (on p processors).

, using inequality (1.1), we obtain
(1.2)

From this we can see that for f=0 and equality T i we get S=p, for f >0 and p → ∞ we get
. This function is monotonically increasing in p and, therefore, reaches a maximum at infinity. Therefore, on no number of processors counting acceleration cannot exceed the reciprocal of the fraction of a sequential section.

Considering Amdahl's law, we assumed that the proportion of sequential calculations f is a constant value and does not depend on the parameter n, determining the computational complexity of the problem being solved. However, for a large number of problems the fraction f=f(n) is a decreasing function of n, in which case the speedup for a fixed number of processors can be increased by reducing the proportion of sequential work performed by each processor. In other words, the acceleration Sp= Sp(n) is an increasing function of the parameter n (this statement is often called the Amdahl effect).

Parallelization efficiency -This is the ability of the algorithm to use all processors involved in the task at 100%. Formula for calculating efficiency:

(1.2)

Those. if the acceleration S = p (the maximum possible on a p processor machine), then the efficiency of task parallelization is 100%. Using Amdahl's law we obtain an upper estimate of efficiency:

E≤ 100%
(1.3)

For example, E ≤ 52.25% for p=100 and f=0.01 and E ≤ 9.1% for p=1000 and f=0.01.

Conclusion . With a small proportion of sequential work, an increase in the number of processes leads to a deterioration in parallel efficiency (the reason is that as processes grow, the number of exchanges increases). For example, if f=0.01 (1%), then E<100 и использовать для решения параллельной задачи более 100 процессоров нецелесообразно. To improve efficiency, as a rule, do not parallelize control parts of the program or small sections of calculations that require intensive synchronization of processes.

To evaluate acceleration, another characteristic is considered, which is called faster scaling(scaled speedup). This assessment can show how efficiently parallel computing can be organized as the complexity of the problems being solved increases.

Scaling(scalable) is the ability of a parallel algorithm to efficiently use processors while increasing the complexity of calculations. A problem is scalable if, as the number of processors increases, the algorithm provides a proportional increase in speedup while maintaining a constant level of processor efficiency.

Scalability– this is a proportional increase in the volume of the problem with an increase in the number of processors used to solve it. The presence of task scalability is an important property of test systems for assessing the performance of parallel computing systems.

Poor scalability of a parallel application on an MPP system can be caused by a) an increase in communication costs with an increase in the number of processors used; b) uneven distribution of computing load between processors.

As the number of processors increases while maintaining the problem size, the total number of MPI function calls in the program increases. At the same time, the overhead for generating and sending messages increases, and the amount of calculations per processor decreases, which causes a decrease in the efficiency of parallelization. In the face of an increased number of messages, network latency will have an increasingly negative impact. For clusters whose nodes are symmetric multiprocessors, you can try to reduce the cost of communications by replacing multiprocessing within each node with multithreaded processing.

Let's estimate the total overhead that occurs when executing a parallel algorithm T 0 = P*Tp − T 1 , Where T 1 - execution time of the sequential algorithm of the task, T p- execution time of the task algorithm on P processors.

Overhead costs arise due to the need to organize interaction between processors, synchronize parallel calculations, etc.

Using the introduced notation, we can obtain new expressions for the time of parallel solution of the problem and the corresponding acceleration:

Tp = (T 1 +T 0 )/P , Sp = T 1 / Tp = (P* T 1 )/(T 1 +T 0 )

Then the efficiency of using processors can be expressed as

E P = Sp/P = T 1 /(T 1 +T 0 ) = 1/(1+ T 1 /T 0 )

Then, if the complexity of the problem being solved is fixed ( T 1 =const), then as the number of processors increases, efficiency will, as a rule, decrease due to increased overhead costs T 0 . With a fixed number of processors, efficiency can be improved by increasing the complexity of the problem being solved T 1 , since it is assumed that as complexity increases, the overhead T 0 growing slower than computing volume T 1 .

Thus, with an increase in the number of processors, in most cases it is possible to achieve a certain level of efficiency by means of a corresponding increase in the complexity of the tasks being solved. In this regard, an important characteristic of parallel computing is the ratio of the required growth rate of calculation complexity and the number of processors used.

Another important characteristic of the developed algorithms is price (cost ) calculations, defined as the product of the time to solve a problem in parallel and the number of processors used.

2. Data network topology. Examples of elementary topologies, main characteristics. Routing algorithms and data transmission methods.

When organizing parallel computing in multicomputers, data transfer between processors of the computing environment is used to organize interaction, synchronization and mutual exclusion of parallel processes. Time delays when transmitting data over communication lines can be significant (compared to the speed of processors) and, as a result, the communication complexity of the algorithm has a significant impact on the choice of parallel methods for solving problems.

Examples of data network topologies

The topology of a data transmission network is the structure of switching lines between processors of a computer system. The topology is a complete graph in which data transfer can be organized between any two vertices (network processors). The topology is determined taking into account the possibilities of effective technical implementation based on an analysis of the intensity of transmission of information flows. Typical topologies usually include the following processor communication schemes (see figure).

Complete graph(completely-connected graph or clique) - a system in which there is a direct communication line between any pair of processors, so this topology provides minimal costs for data transfer, but is difficult to implement with a large number of processors.

Ruler(linear array or farm) – a system in which all processors are renumbered in order and each processor, except the first and last, has communication lines only with two neighboring (previous and subsequent) processors; such a scheme is, on the one hand, simply implementable, and on the other hand, it corresponds to the structure of data transfer when solving many computing problems (for example, when organizing pipeline calculations).

Ring(ring) – this topology is obtained from a line of processors by connecting the first and last processors of the line.

Star(star) – a system in which all processors have communication lines with some control processor; this topology is effective, for example, when organizing centralized parallel computing schemes.

Lattice(mesh) – a system in which the graph of communication lines forms a rectangular mesh (usually two- or three-dimensional); such a topology can be quite simply implemented and, moreover, can be effectively used in the parallel execution of many numerical algorithms (for example, when implementing methods for analyzing mathematical models described by partial differential equations).

Hypercube(hypercube) – this topology represents a special case of a grid structure, when there are only two processors for each grid dimension; This option for organizing a data transmission network is quite widespread in practice and is characterized by the following number of distinctive features:

a) two processors have a connection if the binary representations of their numbers have

only one different position;

b) an N-dimensional hypercube can be divided into two (N-1)-dimensional hypercubes (a total of N different partitions are possible);

c) the shortest path between any two processors has a length that matches the number of different bit values in the processor numbers (this value is known as the Hamming distance).

Because Since each processor can take part in only one data transmission and reception operation, only those communication operations in which the interacting pairs of processors do not intersect can be performed in parallel.

Transcript

1 Part 3. Methods of parallel computing 6. Principles of developing parallel methods 6. Principles of developing parallel methods Modeling parallel programs Stages of developing parallel algorithms Dividing calculations into independent parts Isolating information dependencies Scaling a set of subtasks Distributing subtasks between processors Parallel solution of the N-body gravitational problem Dividing calculations into independent parts Isolation of information dependencies Scaling and distribution of subtasks among processors Analysis of the efficiency of parallel computing Brief overview of the section Literature review Test questions Tasks and exercises The development of algorithms (and especially parallel computing methods) for solving complex scientific and technical problems is often a significant problem. To reduce the complexity of the topic under consideration, we will leave aside the mathematical aspects of the development and proof of convergence of algorithms; these issues are studied to one degree or another in a number of “classical” mathematical training courses. Here we will assume that computational schemes for solving the problems considered below as examples are already known 1). Taking into account the above assumptions, subsequent actions to determine effective ways to organize parallel computing may be as follows: Perform an analysis of existing computing schemes and divide them (decomposition) into parts (subtasks) that can be implemented largely independently of each other, Select for formed set of subtasks, information interactions that must be carried out in the course of solving the original task. Determine the computing system necessary (or available) to solve the problem and distribute the existing set of subtasks between the system processors. At the most general consideration, it is clear that the amount of computation for each processor used should be approximately the same; this will ensure uniform computational load (balancing) of the processors. In addition, it is also clear that the distribution of subtasks between processors must be done in such a way that the presence of information links (communication interactions) between subtasks is minimal. 1) Despite the fact that for many scientific and technical problems not only sequential but also parallel solution methods are actually known, this assumption is, of course, very strong, since for new emerging problems that require a large amount of calculations for their solution, The process of developing algorithms constitutes a significant part of all work performed.

2 Division of calculations into independent parts Isolation of information dependencies Scaling of subtasks Distribution of subtasks between processors Fig. General scheme for the development of parallel algorithms After completing all of the listed design stages, you can evaluate the effectiveness of the developed parallel methods; for this, the values of the quality indicators of the generated parallel calculations are usually determined (acceleration, efficiency, scalability) . Based on the results of the analysis, it may be necessary to repeat individual (in the extreme case all) stages of development, it should be noted that a return to previous development steps can occur at any stage of the design of parallel computing circuits. In this regard, a frequently performed additional action in the above design scheme is to adjust the composition of the generated set of tasks after determining the available number of processors; subtasks can be enlarged (aggregated) if there is a small number of processors or, conversely, detailed otherwise. In general, these actions can be defined as scaling the developed algorithm and identified as a separate stage in the design of parallel computing. To apply the ultimately obtained parallel method, it is necessary to develop programs to solve the generated set of subtasks and place the developed programs on processors in accordance with the selected subtask distribution scheme. To carry out calculations, programs are launched for execution (programs at the execution stage are usually called processes); to implement information interactions, programs must have at their disposal means of data exchange (message transmission channels). It should be noted that each processor is usually allocated to solve one single subtask, however, if there are a large number of subtasks or using a limited number of processors, this rule may not be observed and, as a result, several programs (processes) can be executed simultaneously on the processors. In particular, when developing and initially testing a parallel program, one processor can be used to execute all processes (when located on one processor, processes are executed in a time-shared mode). Having carefully examined the carefully developed scheme for the design and implementation of parallel computing, it can be noted that this approach is largely focused on computing systems with distributed memory, when the necessary information interactions are implemented by passing messages over communication channels between processors. However, this scheme can be used without losing any efficiency of parallel computing and for developing parallel methods for systems with shared memory; in this case, message passing mechanisms to ensure information interactions must be replaced by access operations to common (shared) variables Modeling parallel programs The considered scheme for the design and implementation of parallel computing provides a way to understand parallel algorithms and programs. At the design stage, a parallel method can be represented in the form of a “message subtask” graph, which is nothing more than a large (aggregated) representation of the information dependency graph (the “operations-operands” graph, see Section 2). Similarly, at the execution stage, to describe a parallel program, a model in the form of a “process channels” graph can be used, in which the concept of processes is used instead of subtasks, and information dependencies are replaced by channels 2

3 message transmissions. In addition, this model can show the distribution of processes among the processors of a computer system if the number of subtasks exceeds the number of processors, see Fig. process - channel - receive (transmit) operations - input (output) channels for interaction between processes Fig Model of a parallel program in the form of a graph "process-channels" The use of two models of parallel computing 2) makes it possible to better separate the problems that arise when developing parallel methods. The first model of the “subtasks - messages” graph allows you to focus on the issues of identifying subtasks of the same computational complexity, while ensuring a low level of information dependence between subtasks. The second model of the “process channels” graph focuses on the issues of distributing subtasks across processors, providing another opportunity to reduce the complexity of information interactions between subtasks by placing intensively interacting processes on the same processors. In addition, this model allows us to better analyze the effectiveness of the developed parallel method and provides the possibility of a more adequate description of the process of performing parallel calculations. Let us give additional explanations for the concepts used in the “process-channels” model: In the framework of this educational material, by process we mean a program executed on the processor, which uses part of the local memory of the processor for its work and which contains a number of operations for receiving/transmitting data to organize information interaction between executing processes of a parallel program. A data channel can be viewed from a logical point of view as a message queue into which one or more processes can send forwarded data and from which the destination process can retrieve messages sent by other processes. In general, we can assume that channels arise dynamically at the time the first receive/transmit operation is performed on the channel. In terms of generality, a channel can correspond to one or more commands for receiving data from the recipient process; similarly, when passing messages, a channel can be used by one or more data transfer commands from one or more processes. To reduce the complexity of modeling and analysis of parallel methods, we will assume that the channel capacity is unlimited and, as a result, data transfer operations are performed almost without delay by simply copying messages into the channel. On the other hand, message receiving operations may result in delays (blocking) if the data requested from the channel has not yet been sent by the message source processes. It should be noted that the important advantage of the considered “process-channel” model is that in this model there is a clear separation of local (performed on a separate processor) calculations and 2) In Foster (1995), only one model is considered, the “task-channel” model for describing parallel computing, which takes some intermediate position compared to the models presented here. Thus, the “task-channel” model does not take into account the possibility of using one processor to solve several subtasks simultaneously. 3

4 actions to organize information interaction between simultaneously running processes. This approach significantly reduces the complexity of analyzing the effectiveness of parallel methods and significantly simplifies the problems of developing parallel programs. Stages of developing parallel algorithms Let us consider in more detail the methodology for developing parallel algorithms outlined above. To a large extent, this technique relies on the approach first discussed in Foster (1995), and, as noted earlier, includes the stages of identifying subtasks, determining information dependencies, scaling and distributing subtasks across processors of the computer system (see Fig. 6.1). To demonstrate the recommendations given, we will further use the educational problem of finding the maximum value among the elements of matrix A (such a problem arises, for example, when numerically solving systems of linear equations to determine the leading element of the Gaussian method): y = max a. 1 i, j N i j This problem is purely illustrative, and after reviewing the development steps, the remainder of this section will provide a more complete example of using this technique to develop parallel algorithms. In addition, this development scheme will be applied when presenting all of the parallel computing methods considered below. Dividing calculations into independent parts. The choice of a method for dividing calculations into independent parts is based on an analysis of the computational scheme for solving the original problem. The requirements that the chosen approach must satisfy usually consist of ensuring an equal amount of computation in the allocated subtasks and a minimum of information dependencies between these subtasks (other things being equal, preference should be given to rare operations of transferring large messages over frequent transfers of small data). In the general case, carrying out analysis and identifying tasks is a rather complex problem; the situation helps to resolve the existence of two commonly encountered types of computational schemes: a) b) Fig. Data division for matrix A: a) strip scheme, b) block scheme For a large class of computation problems are reduced to performing the same type of processing of elements of a large set of data. This type of task includes, for example, matrix calculations, numerical methods for solving partial differential equations, etc. In this case, they say that there is data parallelism, and the selection of subtasks comes down to dividing the available data . So, for example, for our educational task of finding the maximum value when forming subtasks, the original matrix A can be divided into individual rows (or successive groups of rows) - a strip data separation scheme (see Fig. 6.3) or into rectangular sets of elements - a block data separation scheme. For a large number of problems being solved, dividing calculations by data leads to the generation of one-, two-, and three-dimensional sets of subproblems for which information connections exist only between nearest neighbors (such schemes are usually called meshes or lattices), 4

5 Fig. Regular one-, two- and three-dimensional structures of basic subtasks after data decomposition For another part of the problems, calculations may consist of performing different operations on the same set of data; in this case, they speak about the existence of functional parallelism (examples include tasks of processing a sequence of queries to information databases, calculations with the simultaneous use of different calculation algorithms, etc.). Very often, functional decomposition can be used to organize pipeline data processing (for example, when performing any data transformations, calculations can be reduced to a functional sequence of data input, processing and saving). An important issue when identifying subtasks is choosing the desired level of decomposition of calculations. Forming the maximum possible number of subtasks ensures the use of the maximum achievable level of parallelism of the problem being solved, but complicates the analysis of parallel calculations. Using only fairly “large” subtasks when decomposing calculations leads to a clear parallel computing scheme, but can make it difficult to effectively use a sufficiently large number of processors. A possible reasonable combination of these two approaches may consist in using as constructive elements of decomposition only those subproblems for which parallel computing methods are known. For example, when analyzing a matrix multiplication problem, you can use vector scalar product methods or matrix-vector product algorithms as subproblems. Such an intermediate method of decomposing calculations will ensure both the simplicity of representation of computational schemes and the efficiency of parallel calculations. In this approach, we will further refer to the selected subtasks as basic ones, which can be elementary (indivisible) if they do not allow further division, or composite otherwise. For the educational task under consideration, a sufficient level of decomposition may consist, for example, in dividing the matrix A into many individual rows and obtaining on this basis a set of subtasks for finding maximum values in individual rows; the structure of information connections generated in this case corresponds to a linear graph, see Fig. To assess the correctness of the stage of dividing calculations into independent parts, you can use the checklist of questions proposed in Foster (1995): Does the decomposition performed not increase the amount of calculations and the required amount of memory? With the chosen decomposition method, is it possible to uniformly load all available processors? Are the dedicated parts of the computation process sufficient to efficiently load the existing processors (taking into account the possibility of increasing their number)? Identification of information dependencies If there is a computational scheme for solving a problem, after identifying basic subtasks, determining information dependencies between subtasks usually does not cause much difficulty. At the same time, however, it should be noted that in fact, the stages of identifying subtasks and information dependencies are quite difficult to separate. The allocation of subtasks should take into account emerging information connections; After analyzing the volume and frequency of necessary information exchanges between subtasks, it may be necessary to repeat the stage of dividing calculations. When analyzing information dependencies between subtasks, one should distinguish (preferred forms of information interaction are underlined): Local and global data transfer schemes for local data transfer schemes at each point in time are performed only between a small number of subtasks (located as 5

6 usually on neighboring processors), for global data transfer operations, all subtasks take part in the communication process. Structural and arbitrary methods of interaction For structural methods, the organization of interactions leads to the formation of some standard communication schemes (for example, in the form of a ring, a rectangular lattice, etc.). etc.), for arbitrary interaction structures, the scheme of performed data transfer operations is not homogeneous. Static or dynamic data transfer schemes for static schemes, the moments and participants of information interaction are fixed at the stages of design and development of parallel programs; for a dynamic version of interaction, the structure of the data transfer operation is determined during the calculations being performed, Synchronous and asynchronous methods of interaction for synchronous methods, data transfer operations are performed only when all participants in the interaction are ready and are completed only after the complete completion of all communication actions; when performing asynchronous operations, the participants in the interaction do not have to wait for the complete completion of data transfer actions. For the presented methods of interaction, it is quite difficult to identify the preferred forms of organizing data transfer: the synchronous option is, as a rule, easier to use, while the asynchronous method can often significantly reduce time delays caused by information interaction operations. As noted in the previous paragraph, for the educational task of searching for the maximum value when using subtasks of searching for maximum values in individual rows of the original matrix A as basic elements, the structure of information connections has the form shown in Fig. Fig. Structure of information connections of the educational task As before, for assessment To ensure the correctness of the stage of identifying information dependencies, you can use the checklist of questions proposed in Foster (1995): Does the computational complexity of subtasks correspond to the intensity of their information interactions? Is the intensity of information interactions the same for different subtasks? Is the information interaction scheme local? Does the identified information dependence prevent the parallel solution of subtasks? Scaling a set of subtasks Scaling the developed parallel computing scheme is carried out if the number of available subtasks differs from the number of processors planned for use. To reduce the number of subtasks, it is necessary to perform enlargement (aggregation) of calculations. The rules used here coincide with the recommendations of the initial stage of identifying subtasks; the defined subtasks, as before, must have the same computational complexity, and the volume and intensity of information interactions between subtasks must remain at the minimum possible level. As a result, the first candidates for merging are subtasks with a high degree of information interdependence. If the available set of subtasks is insufficient to load all processors available for use, it is necessary to perform detailed calculations (decomposition). Like 6

7 As a rule, carrying out such a decomposition does not cause any difficulties if parallel computing methods are known for basic problems. The implementation of the calculation scaling stage should ultimately come down to the development of rules for aggregation and decomposition of subtasks, which should parametrically depend on the number of processors used for calculations. For the considered educational problem of finding the maximum value, the aggregation of calculations can consist of combining individual rows into groups (ribbon scheme for dividing the matrix, see Fig. 6.3a); when decomposing subtasks, the rows of the original matrix A can be divided into several parts (blocks). The checklist suggested by Foster (1995) for assessing the correctness of the scaling step is as follows: Will computational locality deteriorate after scaling the existing set of subproblems? Do the subtasks after scaling have the same computational and communication complexity? Does the number of tasks match the number of available processors? Do scaling rules depend parametrically on the number of processors? Distributing subtasks between processors Distributing subtasks between processors is the final stage of developing a parallel method. It should be noted that control of load distribution for processors is possible only for computing systems with distributed memory; for multiprocessors (systems with shared memory), load distribution is usually performed automatically by the operating system. In addition, this stage of distributing subtasks between processors is redundant if the number of subtasks coincides with the number of available processors, and the topology of the data transmission network of the computer system is a complete graph (i.e., all processors are interconnected by direct communication lines). The main indicator of the success of this stage is the efficiency of processor utilization, defined as the relative proportion of time during which processors were used for calculations related to solving the original problem. The ways to achieve good results in this direction remain the same as before; it is necessary to ensure an even distribution of the computational load between processors and minimize the number of messages transmitted between processors. In the same way as in the previous design stages, the optimal solution to the problem of distributing subtasks between processors is based on an analysis of the information connectivity of the “subtasks - messages” graph. So, in particular, it is advisable to place subtasks between which there are information interactions on processors between which there are direct data lines. It should be noted that the requirement to minimize information exchanges between processors may contradict the condition of uniform process load. So, we can place all subtasks on one processor and completely eliminate interprocessor message passing, however, of course, the load on most processors in this case will be minimal. For our educational task of searching for the maximum value, the distribution of subtasks between processors does not cause any difficulties; it is only necessary to ensure that the subtasks between which there are information connections are placed on processors for which there are direct data transfer channels. Since the structure of information connections of an educational task has the form of a linear graph, the fulfillment of this requirement can be ensured for almost any computer system network topology. Solving the issues of balancing the computing load becomes much more complicated if the calculation scheme can change during the solution of the problem. The reason for this may be, for example, inhomogeneous grids when solving partial differential equations, sparseness of matrices, etc. 3). In addition, estimates of the computational complexity of solving subproblems used at the design stages can be approximate and, finally, the number of subproblems can change during the calculations. In such situations, it may be necessary to redistribute the basic subtasks between 3) It can be noted that even for our simple training task, different computational complexity of the generated basic tasks can be observed. So, for example, the number of operations when searching for the maximum value for a row in which the first element has the maximum value and a row in which the values are ordered in ascending order will differ by a factor of two. 7

8 processors already directly in the process of executing a parallel program (or, as they usually say, you will have to perform dynamic balancing of the computing load). These issues are among the most difficult (and most interesting) in the field of parallel computing, unfortunately, consideration of these issues is beyond the scope of this tutorial (additional information can be obtained, for example, Buyya (1999) and Wilkinson and Allen (1999)) . As an example, we will give a brief description of a widely used method of dynamically managing the distribution of computing load, usually called the manager-worker scheme. When using this approach, it is assumed that subtasks can arise and be completed during calculations, while information interactions between subtasks are either completely absent or minimal. In accordance with the scheme under consideration, a separate processor-manager is allocated to manage load distribution in the system, which has access to information about all available subtasks. The remaining processors of the system are executors that contact the manager processor to receive the computational load. New subtasks generated during calculations are transferred back to the manager processor and can be obtained for solution during subsequent calls by executing processors. Completion of calculations occurs at the moment when the executing processors have completed the solution of all subtasks transferred to them, and the manager processor does not have any computational work to perform. Foster's (1995) checklist for testing the subtask allocation phase is as follows: Does allocating multiple tasks to a single processor introduce additional computational costs? Is there a need for dynamic balancing of calculations? Isn't the processor-manager a bottleneck when using the manager-executor scheme? 6.3. Parallel solution of the N-body gravitational problem Many computational problems in the field of physics come down to data processing operations for each pair of objects in the existing physical system. Such a problem is, in particular, the problem widely known in the literature as the N-body gravitational problem (or simply the N-body problem), see, for example, Andrews (2000). In its most general form, the problem can be described as follows. Let a large number of bodies (planets, stars, etc.) be given. ), for each of which the mass, initial position and speed are known. Under the influence of gravity, the position of the bodies changes, and the required solution to the problem is to simulate the dynamics of changes in a system of N bodies over a certain specified time interval. To carry out such a simulation, a given time interval is usually divided into time intervals of short duration and then at each modeling step the forces acting on each body are calculated, and then the velocities and positions of the bodies are updated. An obvious algorithm for solving the N-body problem is to consider at each modeling step all pairs of objects of the physical system and perform all the necessary calculations for each resulting pair. As a result, with this approach, the execution time for one simulation iteration will be 4) T = τ N(N 1) / 2, 1 where τ is the time for recalculating the parameters of one pair of bodies. As follows from the above description, the computational scheme of the considered algorithm is relatively simple, which makes it possible to use the N-body problem as another clear demonstration of the application of the methodology for developing parallel algorithms. 4) It should be noted that there are more efficient sequential algorithms for solving the N-body problem, but studying them may require quite a lot of effort. Taking into account this circumstance, this “obvious” (but not the fastest) method is chosen for further consideration, although, in the general case, of course, the best calculation schemes should be chosen for parallelization. 8

9 Dividing calculations into independent parts Choosing a method for dividing calculations does not cause any difficulties - the obvious approach is to select as a basic subtask the entire set of calculations associated with processing data from one body of a physical system. Isolating information dependencies Performing calculations associated with each subtask, becomes possible only if the subtasks contain data (position and speed of movement) about all bodies of the existing physical system. As a result, before each simulation iteration begins, each subtask must obtain all necessary information from all other subtasks of the system. This data transfer procedure, as noted in Section 3, is called a data collection operation (single-node gather). In the algorithm under consideration, this operation must be performed for each subtask; this data transfer option is usually referred to as a generalized data collection operation (multi-node gather or all gather). Determining the requirements for the necessary results of information exchange does not lead to an unambiguous establishment of the necessary information exchange between subtasks; achieving the required results can be achieved using different algorithms for performing a generalized data collection operation. The simplest way to perform the necessary information exchange is to implement a sequence of steps, at each of which all existing subtasks are divided into pairs and data exchange is carried out between the subtasks of the resulting pairs. With proper organization of pairwise division of subtasks, (N-1) repetition of the described actions will lead to the complete implementation of the required data collection operation. The method of organizing information exchange discussed above is quite labor-intensive; collecting all the necessary data requires (N-1) iterations, each of which simultaneously performs (N/2) data transfer operations. To reduce the required number of iterations, you can pay attention to the fact that after completing the first step of the data collection operation, the subtasks will already contain not only their own data, but also the data of the subtasks with which they formed pairs. As a result, at the second iteration of data collection, it will be possible to form pairs of subtasks to exchange data about two bodies of the physical system at once, thereby, after completion of the second iteration, each subtask will contain information about four bodies of the system, etc. As you can see, this method of implementing exchanges allows you to complete the necessary procedure in log 2 N iterations. It should be noted that the volume of data sent in each exchange operation doubles from iteration to iteration; at the first iteration, data about one body of the system is sent between subtasks, at the second iteration about two bodies, etc. Using the considered method of implementing the operation of generalized data collection leads to determining the structure of information connections between subtasks in the form of an N-dimensional hypercube. Scaling and distribution of subtasks among processors As a rule, the number of bodies of a physical system N significantly exceeds the number of processors p. As a result, the previously considered subtasks should be enlarged by combining calculations for a group of (N/p) bodies within one subtask. After such aggregation, the number of subtasks and the number of processors will coincide, and when distributing subtasks between processors, all that remains is to ensure the presence of direct communication lines between processors with subtasks, between which there is information exchange during the data collection operation Analysis of the efficiency of parallel computing Let us evaluate the effectiveness of the developed methods of parallel computing to solve the N-body problem. Since the proposed options differ only in the methods of performing information exchanges, to compare approaches it is enough to determine the duration of the generalized data collection operation. We use the model proposed by Hockney to estimate the message transmission time (see Section 3), then the duration of the data collection operation for the first version of parallel computing can be expressed as 1 T p (comm) = (p 1)(α + m (N / p) / β), where α, β are the parameters of the Hockney model (latency and bandwidth of the data transmission network), and m specifies the amount of data sent for one body of the physical system. 9

10 For the second method of information exchange, as noted earlier, the volume of data sent at different iterations of the data collection operation varies. At the first iteration, the volume of messages sent is (mn/p), at the second iteration this volume doubles and turns out to be equal to 2(mN/p), etc. In general, for iteration number i, the volume of messages is estimated as 2 i-1 (mn/p). As a result, the duration of the data collection operation in this case can be determined using the following expression T 2 p log p i= 1 i 1 (comm) = (α + 2 m(N / p) / β) = α log p + m (N / p)(p 1) / β. A comparison of the obtained expressions shows that the second developed method of parallel computing has significantly higher efficiency, incurs lower communication costs and allows for better scalability with an increase in the number of processors used. Brief overview of the section The section examined the methodology for developing parallel algorithms proposed in Foster (1995). This technique includes the stages of identifying subtasks, determining information dependencies, scaling and distributing subtasks across processors of a computer system. When applying the technique, it is assumed that the computational scheme for solving the problem under consideration is already known. The main requirements that must be met when developing parallel algorithms are to ensure uniform load on processors with low information interaction between the formed set of subtasks. To describe the computational parallel circuits obtained during the development, two models are considered. The first of these, the “subtask-message” model can be used at the stage of designing parallel algorithms, the second model “processes-channels” can be used at the stage of implementing methods in the form of parallel programs. At the end of the section, the application of the considered methodology for developing parallel algorithms is shown using the example of solving the N-body gravitational problem. Literature review The methodology for developing parallel algorithms discussed in the section was first proposed in Foster (1995). In this work, the methodology is presented in more detail; In addition, the work contains several examples of using the technique to develop parallel methods for solving a number of computational problems. Quinn (2004) may also be useful when considering the design and development of parallel algorithms. The N-body gravitational problem is discussed in more detail in Andrews (2000) Test Questions 1. What are the initial assumptions for the possibility of applying the methodology for developing parallel algorithms discussed in this section? 2. What are the main stages in the design and development of parallel computing methods? 3. How is the subtask-message model defined? 4. How is the process-channel model defined? 5. What basic requirements must be met when developing parallel algorithms? 6. What are the main actions at the stage of identifying subtasks? 7. What are the main actions at the stage of determining information dependencies? 8. What are the main actions at the stage of scaling the existing set of subtasks? 9. What are the main actions at the stage of distributing subtasks among the processors of a computer system? 10. How is the distribution of computing load dynamically controlled using the “manager-executor” scheme? 11. What parallel computing method was developed to solve the N-body gravitational problem? 10

11 12. Which way of performing a general data collection operation is more efficient? 6.7. Tasks and exercises 1. Implement a cascade scheme for calculating the sum of a sequence of numerical values (see Section 2) and compare the execution time of the completed implementation and the MPI_Bcast function of the MPI library. 2. Implement the considered methods for performing a generalized data collection operation and compare their execution time. Compare the resulting time characteristics with the theoretical estimates. Compare with the execution time of the MPI library function MPI_Allgather. 3. Develop a parallel computing scheme using the methodology for designing and developing parallel methods discussed in the section: for the problem of finding the maximum value among the minimum elements of the rows of a matrix (this problem occurs for solving matrix games) y = max min a, 1 i N 1 j N ij (pay special attention to the situation when the number of processors exceeds the order of the matrix, i.e. p>n), for the problem of calculating the definite integral using the rectangle method b N 1 y = f (x) dx h fi, a i= 0 f i = f (x), x = i h, h = (b a) / N. i i (a description of integration methods is given, for example, in Kahaner, Moler and Nash (1988)) 4. Implement the developed parallel methods for problems n Develop a parallel scheme calculations for the problem of matrix-vector multiplication, using the methodology for designing and developing parallel methods discussed in the section. References Andrews, G. R. (2000). Foundations of Multithreaded, Parallel, and Distributed Programming.. Reading, MA: Addison-Wesley (Russian translation by Andrews G.R. Fundamentals of multithreaded, parallel and distributed programming. M.: Williams Publishing House, 2003) Bertsekas, D.P., Tsitsiklis , J.N. (1989) Parallel and distributed computing. Numerical Methods. - Prentice Hall, Englewood Cliffs, New Jersey. Buyya, R. (Ed.) (1999). High Performance Cluster Computing. Volume1: Architectures and Systems. Volume 2: Programming and Applications. - Prentice Hall PTR, Prentice-Hall Inc. Kahaner, D., Moler, C., Nash, S. (1988). Numerical Methods and Software. Prentice Hall (Russian translation Kahaner D., Mowler L., Nash S. Numerical methods and software. M.: Mir, 2001) Foster, I. (1995). Designing and Building Parallel Programs: Concepts and Tools for Software Engineering. Reading, MA: Addison-Wesley. Quinn, M. J. (2004). Parallel Programming in C with MPI and OpenMP. New York, NY: McGraw-Hill. Wilkinson, B., Allen, M. (1999). Parallel programming. Prenrice Hall. eleven

CHAPTER 3 PRINCIPLES OF DEVELOPMENT OF PARALLEL METHODS The development of algorithms (and especially parallel computing methods) for solving complex scientific and technical problems is often a significant

Methods and algorithms of parallel computing Design of parallel algorithms Kulakov Kirill Aleksandrovich 2016 Petrozavodsk Design goals Load balancing Scalability Efficiency

High-performance computing Lecture 2. Estimating the maximum possible parallelism Providing the best speedup S T = efficiency E = 1 is not possible for all computationally T labor-intensive

Lectures Lecture 1. Principles of constructing parallel computing systems.................................. 23 Lecture 2. Modeling and analysis of parallel computing.. .... 49 Lecture 3. Assessing communication

Nizhny Novgorod State University named after. N.I. Lobachevsky Faculty of Computational Mathematics and Cybernetics Educational complex Introduction to parallel programming methods Section 9. Parallel

Project of the Presidential Commission on modernization and technological development of the Russian economy “Creation of a system for training highly qualified personnel in the field of supercomputer technologies and specialized

Topic: Parallelization of expressions using arithmetic as an example. Basic characteristics of complexity and parallelism. What can be parallelized? Problem (decomposition into subproblems of smaller dimension) 2Method

QUESTIONS FOR THE TEST IN THE COURSE “PARALLEL COMPUTING SYSTEMS” 1. Principles of constructing parallel computing systems (15) 1. Schemes of multiprocessor systems with homogeneous and heterogeneous access. 2.

Design of parallel algorithms Lecture 3.1 03/29/2012 T.Yu.Lymar 1 3.1 Design methodology Separation Establishing connections Aggregation Linking to a specific computer 03/29/2012 T.Yu.Lymar 2 3.1.1

Moscow State University named after. M.V. Lomonosov History and methodology of parallel programming 9. Design of parallel algorithms Developers: L.B. Sokolinsky, Doctor of Physical and Mathematical Sciences, Professor

Federal Agency for Education Nizhny Novgorod State University named after. N.I. Lobachevsky National Project “Education” Innovative educational program of UNN. Educational and scientific

Nizhny Novgorod State University named after. N.I. Lobachevsky Faculty of Computational Mathematics and Cybernetics Department of Computer Software Laboratory “Information Technologies” ItLab Practical

Nizhny Novgorod State University named after. N.I. Lobachevsky - National Research University - Lecture. Modeling of parallel computing Gergel V.P., Dean of the Computer Science and Computing Center of UNN Supercomputers

Algorithms for parallel computing systems 1. Types of parallelism and methods for synthesizing parallel algorithms. 2. Evaluating the effectiveness of parallel algorithms. 1. Types of parallelism and methods of parallel synthesis

SUPERCOMPUTER CONSORTIUM OF RUSSIAN UNIVERSITIES Project Creation of a system for training highly qualified personnel in the field of supercomputer technologies and specialized software

Evaluating the efficiency of parallel algorithms Lecture 4. 03/29/2012 T.Yu. Lymar 1 Introduction A fundamental point in the development of parallel algorithms is the analysis of the efficiency of using parallelism:

Evaluating the efficiency of parallel algorithms Lecture 7 T.Yu. Lymar A fundamental point in the development of parallel algorithms is the analysis of the efficiency of using parallelism: Assessing the maximum possible

BASIC CONCEPTS OF PARALLEL COMPUTING Parallel computing (parallel processing) is the use of several or many computing devices to simultaneously execute different parts of one

Mathematical models and methods for the effective use of distributed Digital computing 3D medicine systems Title Results Subtitle in the field of computer graphics and geometric presentation

UDC 681.5 PARALLEL ALGORITHMS FOR NUMERICAL SOLUTION OF THE CAUCHA PROBLEM FOR SODA Nazarova I.A. Donetsk National Technical University Parallel numerical algorithms of one-step methods for

CHAPTER MODELING AND ANALYSIS OF PARALLEL COMPUTING When developing parallel algorithms for solving complex scientific and technical problems, a fundamental point is to analyze the efficiency of using parallelism,

1. Goals and objectives of the discipline: Supercomputer technologies and high-performance computing using multiprocessor computing systems (MCS) are becoming an important factor in scientific and technical

Construction of statistical models of the efficiency of parallel programs V.N. Beletsky, S.A. Reznikova, A.A. Chemeris Institute for Modeling Problems in Energy. G.E. Pukhov NAS of Ukraine The article discusses

Informatics, management, economics MIPT PROCEEDINGS 2 Volume 2, (5) UDC 59687+475 AS Khritankov Moscow Institute of Physics and Technology (State University) Mathematical model of performance characteristics

ALGORITHMS FOR BALANCING LOAD OF PROCESSORS OF A PARALLEL COMPUTING SYSTEM Belkov D.V. Donetsk National Technical University, Donetsk Department of Computational Mathematics and Programming

Computers and software UDC 681.3.06 P.A. Pavlov EFFICIENCY OF DISTRIBUTED COMPUTING IN SCALABLE SYSTEMS Scalability is one of the most important requirements

Ritz method There are two main types of methods for solving variational problems. The first type includes methods that reduce the original problem to solving differential equations. These methods are very well developed

DIAGONAL METHOD FOR MULTIPLICATION OF DENSE MATRICES Knyazkova T.V., Ph.D., Associate Professor, Vyatka State University, Kirov Today, with the growing power of computing systems and modern supercomputers in a wide range of economic sectors

Introduction 1 Chapter 1 Tasks 1.1 Warm-up The first task for writing a program using the MPI library is one for everyone. 1.1.1 Calculating the number π Calculate the number π using the following formula: 1 1 dx 4 1 + x

Laboratory work 4 Parallel implementation of the Jacobi method in a three-dimensional domain Goal of work: practical development of methods for parallelizing numerical algorithms on regular grids using an implementation example

R. I. Idrisov TIME DEVELOPMENT OF THE INTERNAL REPRESENTATION IR2 OF THE SISAL 3.1 LANGUAGE * Today, the increase in computing power is no longer associated with the acceleration of an individual, but with the addition of additional

Strategy for optimization research and methods for solving problems of static and dynamic optimization of technological objects. Problems of static optimization of technological objects are traditionally formulated

ORGANIZATION OF PARALLEL GRAIN COMPUTING PROCESSES (Obtaining parallel sequences of granular calculations) Let us give examples of obtaining parallel algorithms, the sets of operations of which

PARALLEL PROPERTIES OF THE ALGORITHM Parallel computers (supercomputers) are designed to quickly solve large problems. The more powerful the computer, the faster a problem can potentially be solved on it. Besides

Kalyaev A.V. PROGRAMMING VIRTUAL ARCHITECTURES AND ORGANIZATION OF STRUCTURAL-PROCEDURAL COMPUTING IN MULTIPROCESSOR SYSTEMS WITH MASSIVE PARALLELISM 1 Abstract of the Research Institute of Multiprocessor Computing

Algorithms for parallel matrix multiplication 1 Strip algorithms for matrix multiplication In these algorithms, matrices are divided into continuous sequences of rows or columns (strips). In the simplest case

Memory allocation Memory allocation is the process by which individual elements of the source program are assigned the address, size and attributes of the memory area required to accommodate

SOLUTION OF NONLINEAR EQUATIONS AND SYSTEMS OF NONLINEAR EQUATIONS.. SOLUTION OF NONLINEAR EQUATIONS of the form Numerical solution of nonlinear algebraic or transcendental equations. is to find the values

“Algebra and Geometry” 13. Systems of linear algebraic equations (SLAE). Kronecker-Capelli theorem. General and particular solutions of SLAE. 14. Second order curves: ellipse, hyperbola, parabola, and their properties.

UDC 681.32 INCREASING THE PRODUCTIVITY OF WORKSTATION CLUSTERS USING FAN DISTRIBUTION OF ADDITIONAL TASKS FOR Idle EQUIPMENT 2012 V. M. Dovgal 1, S. G. Spirin 2 1 professor

Algorithm graph and parallel computing. Internal parallelism of programs. Lecture 3 04/12/2012 (C) L.B. Sokolinsky 1 3.1 Internal parallelism A program contains parallelism if some of its parts (operators)

MINISTRY OF EDUCATION AND SCIENCE OF THE RUSSIAN FEDERAL STATE BUDGET EDUCATIONAL INSTITUTION OF HIGHER PROFESSIONAL EDUCATION “SAMARA STATE AEROSPACE UNIVERSITY NAMED AFTER ACADEMICIAN S.P. KOROLEV

Lecture 5 5 Theorem for the existence and uniqueness of a solution to the Cauchy problem for a normal ODE system Statement of the problem The Cauchy problem for a normal ODE system x = f (, x), () consists of finding a solution x =

Chapter 5. METHODS OF IMPLICIT SURVEY Let us consider the general formulation of the discrete optimization problem mif (x), (5.) x D in which the -dimensional sought vector x belongs to a finite set of feasible solutions D.

CONTENTS Introduction.... 12 P a r t I. Fundamentals of parallelization Lecture 1. On the formulation of the parallelization problem... 17 1.1. Introduction.... 17 1.2. About some computational problems.... 19 1.3. Numerical

UDC 68.3.06 DETERMINATION OF THE NUMBER AND LOCATION TOPOLOGY OF STATIONS OF A MULTIPROCESSOR COMPUTING SYSTEM A.V. Pogrebnoy Institute "Cybernetic Center" TPU E-mail: [email protected] Problems considered

EXTRAPOLATIONAL BLOCK ONE-STEP METHODS FOR NUMERICAL HIGH-PRECISION SOLUTION OF THE CAUCHY PROBLEM Kulakov V.V. Nazarova I. A. Feldman L. P. Donetsk National Technical University Considering parallel

Proceedings of ISA RAS, 2008. V. 32 On the concept of performance in distributed computing systems M. A. Posypkin, A. S. Khritankov Institute of System Analysis of the Russian Academy of Sciences (ISA RAS) In this

2007 SCIENTIFIC BULLETIN of MSTU GA 26 series Radiophysics and radio engineering UDC 6236:6239 ASSESSMENT OF THE FEASIBILITY OF PARALLELING INFORMATION-DEPENDENT TASKS IN COMPUTING SYSTEMS RN AKINSHIN Article presented

Maximum parallelization of algorithms based on the concept of Q-determinant Valentina Nikolaevna Aleeva South Ural State University (NRU) Novosibirsk, 2015 INTRODUCTION The report discusses

Ministry of Education and Science of the Russian Federation Nizhny Novgorod State University named after. N.I. Lobachevsky V.P. Gergel HIGH-PERFORMANCE COMPUTING FOR MULTI-CORE MULTI-CORE

LC 1. Modeling. 1. Basic concepts. 2 Modeling principles. 3 Properties of models 4 Classification of modeling methods. 5. Mathematical modeling 1. BASIC CONCEPTS. Simulation substitution

Federal Agency for Education State educational institution of higher professional education "Novosibirsk State University" (NSU) Faculty of Information Technologies

Nizhny Novgorod State University named after. N.I. Lobachevsky Research University Creation of an educational library of parallel methods Parlib Completed by: Kozinov E.A. Kutlaev M.V. Osokin

UDC 681.3.06 DESIGN OF A LOCAL NETWORK STRUCTURE FOR A DISTRIBUTED REAL-TIME COMPUTING SYSTEM A.V. Pogrebnoy, D.V. Pogrebnoy Institute "Cybernetic Center" TPU E-mail: [email protected]

PARALLEL ALGORITHMS OF THE CYCLIC SWING METHOD Golovashkin D.L., Filatov M.V. Institute of Image Processing Systems RAS Samara State Aerospace University Abstract The work is dedicated to

UDC 519.856; 519.854; 519.85 STATISTICAL SEARCH FOR INFORMATION COMPUTING NETWORK STRUCTURES V.V. Malygin The properties of convergence of the function for estimating the structure of an information and computer network have been studied. On

Construction of recursive-parallel algorithms for solving computational geometry problems based on the “distribute and conquer” strategy V.N. Tereshchenko The paper discusses one of the approaches to building effective

12.1. I/O based on device readiness poll The readiness or unreadiness of an external device for I/O is checked in the external device status register For software-controlled I/O

FLYNN'S TAXONOMY Julia Kirillova 6057/2 11/22/11 Flynn's taxonomy is a general classification of computer architectures based on the presence of parallelism in command and data flows. proposed in 1972 by Michael Flynn.

Laboratory work 4 Solving the Poisson problem by the Jacobi method in a three-dimensional domain The goal is the practical development of methods for parallelizing algorithms for problems solved by grid methods using the example of a solution

Plaksin M.A.

National Research University Higher School of Economics (Perm branch), Perm, Ph.D., Associate Professor of the Department of Information Technologies in Business, mapl @ list. ru

"SUPERCOMPUTERS" VS "PARALLEL PROGRAMMING". “PARALLEL PROGRAMMING” VS “COLLABORATIVE ACTIVITY”. HOW TO STUDY THE TOPIC “PARALLEL COMPUTING” IN SECONDARY SCHOOL?

KEYWORDS

Computer science, parallel programming, parallel computing, parallel algorithms, supercomputers, primary school, secondary school, TRIZformashka.

ANNOTATION

The article is devoted to the issue of including the topic “parallel computing” in the school computer science course. A number of problems arising in this case are mentioned, the purpose of studying the topic, the selection of material, some proposals for teaching methods, mechanisms for testing the proposed methodology and the accumulated experience are considered. The question of the place of this material in the curriculum is not addressed.

The current stage of development of computer science is associated with the massive spread of parallelism of calculations at all levels (multi-machine clusters, multiprocessor computers, multi-core processors).

The massive spread of parallelism has serious consequences that have yet to be identified and analyzed. Let's start by listing some theoretical problems.

The modern theory of algorithms was created with the concept of a sequential algorithm in mind. How will the refusal to require a sequence of steps affect the concept of an algorithm?

For at least the last 20 years, the concept of “algorithm” has been introduced in schools inextricably linked with the concept of “performer”. This is natural for a sequential algorithm. What to do with a parallel algorithm? Is it performed by one performer or a group of performers? To be specific, let’s take the computer training program “Tank Crew” as an example. In this program, the student is required to program the actions of a tank crew consisting of three people: a gunner, a driver and a loader. Each of them has its own command system. In order to complete a combat mission (hit all targets), all crew members must act in concert. For an example of the playing field for the Tank Crew program, see Fig. 1.

Question: should these three actors be considered as independent performers or as three components (devices) of one complex performer? For the tank crew, the second option seems more natural, since not a single character is able to complete the task on his own. But what if the game becomes more complicated, and a combat mission is assigned to two tanks at once? For three tanks? Three members of one crew can well be considered as three parts of one performer. But each crew is obviously an independent performer. This means that a parallel algorithm for several tanks will be executed by a group of executors at once. It turns out that for a parallel algorithm it is necessary to consider both possibilities: the execution of parallel actions by one executor and by a group of executors. In the case of a tank crew, drawing the line is easy. The performer is the one who is able to solve the task. This executor may consist of several components, each of which performs a certain part of the task, but cannot independently complete the entire task without the help of other components. But whether the separation of “whole performers” and parts of a complex performer will always be as simple is impossible to say now.

File 1*ra Window About the program

Vpolyet everything

Bbno.n«fTb to the highlighted line

Return to initial page**"

would popnlt step by step (after executing the “.order command nesykoa^“ the button gV ygolg “n-b next uwr” will be pressed)

Ё ГГВД iTHWTt. special step

Information step by step

Fig.1. Fragment of the playing field of the Tank Crew program

Identification of the parts of the performer that are capable of independent actions requires that these parts be named somehow. Moreover, the name must allow recursion, since the acting parts of the performer themselves may have a complex structure.

It is necessary to agree on a term to designate a group of cooperative performers. The term “command” is not suitable; it is associated with the “executor command system” and with “central processor commands”. "Collective of performers"? "Brigade of performers"?

Sh. Algorithm

n Hitting1"; Driver Charger

1 Measure orun* on “master sgkll V Stop V Charge 1

g Pci V Stop V Charge 2

3 Wholesale! V Turn clockwise 90 degrees V Charge 1 V

L V V first V Charge? V

5 Fire! V Stop V Charge 1

Í P^chm V St*p V Zaryasya? V

7 Fire! V Stop V Charge 1 V

3 Pa^ V Turn clockwise 45 degrees V Charge 2 V

S Pause V Start V Pause V

10 Pvdea V Forward V Pause ¿d

11 Plrl V Forward V Pause V

12 Paum V Turn clockwise 45 degrees V Pause V

13 Padm V Forward V Pause V

14 V

Fig.2. Fragment of the program for the “Tank Crew” (example of command lines) The traditional concept of the “executor command system” (SCS) and the concept of the team itself require improvement. If we believe that three members of a tank crew form a single performer, then what should be considered the SKI of this performer? And what is considered a team? Or leave the concept of SKI for each character? That is, this is no longer a system of commands of the EXECUTOR, but a system of commands of one of the components of the performer (for which there is no name yet)?

It is convenient to expand the concept of a team to a “line of commands.” For an example of tank crew command lines, see Fig. 2. However, the concept of a “line of commands” only works well for linear algorithms. In other cases, the rulers are formed dynamically. It is impossible to depict them in the form of a visual table.

Among the properties of algorithms, a new practically significant characteristic stands out: the ability to parallelize. A clarifying question is about the possible degree of parallelization (to what extent does it make sense to increase the number of processors when executing a given algorithm).

A separate issue is methods for parallelizing existing sequential algorithms.

Until recently, parallel programming was the province of a small number of highly skilled systems programmers. Today it is becoming part of professional competence. But parallel programming technology differs significantly from traditional sequential programming. In support of this statement, following L.L. Bosova, we will quote the largest Russian specialist in the field of parallel computing V.V. Vojvodina:

“... The mastery of parallel architecture computing technology... by young specialists comes with great difficulties. In our opinion, this is due to the fact that acquaintance with parallel computing, as well as education in this area in general, does not begin with where one should begin. In addition, what you need to start with is not covered in any courses at all. The ability to quickly solve problems using parallel architecture computing technology forces users to change their entire habitual style of interaction with computers. Compared, for example, with personal computers and workstations, almost everything changes: other programming languages are used, most algorithms are modified, users are required to provide numerous non-standard and difficult-to-obtain characteristics of the tasks being solved, the interface ceases to be friendly, etc. An important fact is that failure to fully take into account new operating conditions can significantly reduce the efficiency of using new and, moreover, quite expensive equipment.”

“It is only important that the student learns as early as possible that there are other ways of organizing computational processes, and not just the sequential execution of “operation by operation,” that the most powerful modern computer technology is built on these other methods, that only with such technology it is possible to solve large problems. industrial and scientific tasks, etc. It is important, first of all, to draw students’ attention as early as possible to the need for a critical attitude towards the philosophy of sequential computing. After all, it is this philosophy that they have to deal with throughout their entire education, both at school and at university. And it is precisely this philosophy that hinders the understanding of the features of working on parallel architecture computers.”

Today we need methods for mass training in parallel programming technology. The author of this article believes that in the learning process the time has come for a revolution in the relationship between sequential and parallel programming. Until now, we first taught sequential programming, and then - parallelization of sequential algorithms. Now we need to raise the question of teaching parallel programming right away. A sequential algorithm should be considered as a certain part of a parallel algorithm, which does not require communication with its other parts. How to do this is an open question. There are still some ideas that need practical implementation and testing. It is hoped that in a year at the next conference it will be possible to discuss the results obtained.

Thirty years ago, the beginning of mass computerization of production required an increase in the level of computer literacy of the population. This led to the introduction of computer science into the school curriculum in 1985. But the computer science course in the Soviet (then Russian) version was not limited to “push-button computer science” - to mastering the technology of working with application software packages and computer games. He began to change the thinking style of the younger generation. First of all, this concerned algorithmicity, accuracy, and rigor. Then the computer science course incorporated elements of logic and systems analysis. Subsequently, all this greatly simplified the distribution of much-needed technology in the 21st century. project approach. The talk now is that over the next decade, parallel algorithms should become

element of the general culture of thinking. Question: how will mastering the concept of a parallel algorithm affect the thinking of the next generation, what will the restructuring of consciousness “in a parallel way” lead to?

The massive spread of parallel information processing makes it urgent to move the relevant concepts into the category of publicly accessible and general cultural ones. Familiarity with parallel algorithms should become part of literacy, just as basic concepts of algorithm theory have become over the past quarter century. This can be done only in one way - by including relevant topics in the school computer science course. This means that we need a methodology for initial acquaintance with parallel programming at the high school level.

Historically, the first attempt to include the topic of parallel computing in a school computer science course was made twenty years ago. Twenty years ago, in a course called “Algorithmics”, a “Construction Director” was described, who commanded the parallel actions of several teams building a structure from rectangular and triangular blocks. Moreover, a software implementation was created for this performer. Alas! This wonderful methodological development was not in demand in the mid-90s. She was almost twenty years ahead of her time!

Today, the situation is such that the topic of parallel computing in high school is primarily related to the topic of supercomputers. It is on supercomputers that the authors of various methodological developments focus the attention of students, even when this is not necessary. Suffice it to say that the corresponding section in the journal “Informatics at School” is called “Supercomputer Education at School.” This situation has both positive and negative sides. Among the positive aspects are:

The topic of supercomputers is of interest in society, including among students. This interest repeats at the modern level the interest that half a century ago was caused by large machines - the supercomputers of their time;

Organizational support from the supercomputing community. Every summer, the Faculty of Computational Mathematics and Cybernetics of Moscow State University hosts the Summer Supercomputer Academy. And every summer, within the framework of this Academy, a school track for computer science teachers is organized. Training is provided free of charge. Nonresident students are provided with housing on very favorable terms. At the Russian Supercomputing Days conference in September 2015, a school section and master class for computer science teachers were organized. Consistent organizational work led to the identification and formation of a group of teachers interested in promoting this topic;

The presence of a bright, charismatic leader, such as Vladimir Valentinovich Voevodin - Doctor of Physical and Mathematical Sciences, Professor, Corresponding Member of the Russian Academy of Sciences, Deputy Director of the Research Computing Center of Moscow State University;

Interest and support (including material) from the Russian representative office of Intel and the strategic development manager of Intel, Igor Olegovich Odintsov.

The disadvantage of the “supercomputer” approach is that it narrows the scope of parallel computing. Supercomputers themselves are, as a rule, inaccessible to schoolchildren (unless in large cities they can be seen on excursions). The tasks they are aimed at solving are too complex for schoolchildren and, in most cases, do not have immediate practical significance and are not of practical interest.

A natural extension of the supercomputing field is the study of parallel programming. Currently, to execute parallel programs it is not at all necessary to have a supercomputer. A multi-core processor or video card with a set of graphics accelerators is enough. And this is already available to almost everyone. Among the works in this direction, we note the candidate’s dissertation of M.A. Sokolovskaya on the methodology of teaching future computer science teachers the basics of parallel programming and the experience of E.Yu. Kiseleva on mastering CUDA technology by schoolchildren.

According to the author of this article, focusing on supercomputers and parallel programming significantly impoverishes and complicates the topic of parallel computing and distracts students from many important and accessible issues. The purpose of the topic “parallel

computing” in high school is not teaching “real” parallel programming (studying relevant language constructs, programming languages and technologies), but familiarizing students with the corresponding set of concepts and understanding the features of parallel work. The world around and inside us is a complex parallel system. And this system itself provides a lot of material for mastering the concepts and mechanisms of parallelism. No complex artificial structures such as MPI and OpenMP technologies are needed for this. School computer science should foster thinking in a “parallel way.” And then let the university incorporate professional knowledge, skills, and abilities into this thinking. At school, it makes sense to focus not on getting to know supercomputers and studying parallel programming, but on mastering the mechanisms of “joint activity” that are constantly and widely used in life. The course proposes to cover the following questions:

1) Collaboration of several executors (digging a ditch with several diggers) and parallelization “inside” one executor in the presence of several processing devices (reading and eating an apple). In computer science, this will be a multi-machine complex and a multi-core processor.

2) Types of parallelism: true parallelism and pseudo-parallelism (one processor executes several programs in parts).

3) Performers of the same type (diggers) and different types (tank crew).

4) Works of the same type and different types.

5) The ratio of “executors - jobs”: 1 performer - 1 job, 1 performer - N jobs (pseudo-parallel execution or true parallelism in the presence of several processing devices for different jobs), N performers - 1 job, N performers - N jobs.

6) Coordination of the activities of performers. Types of approval: by parts of work, by time, by results of activities, by resources.

7) Resources. Resources are shared and non-shared, consumable and reusable. Recycling of consumed resources (“garbage collection” in the broad sense).

8) Performing the same work by one performer and a group of performers. Dependence of work speed on the number of performers. Dependence of the cost of work on the number of performers. Nonlinear increase in work speed with an increase in the number of performers. Critical path. Optimal number of performers. Optimal loading of performers. Optimal procedure. Load balancing.

9) Competition between performers for resources. Blocking. Clinch (deadlock).

10) Mechanisms for coordinating the actions of performers.

11) Pseudo-parallel execution of processes on a computer (division of one resource - the processor) between executor-processes.

12) Suitability of algorithms for parallelization. Possible degree of parallelization. The existence of algorithms that cannot be parallelized.

Please note that the above list represents the private opinion of the author of the article and is open for discussion, addition and correction. Moreover, in the author’s opinion, it would be very useful for the “supercomputer community” to formulate a “social order” for the school: what kind of knowledge and skills it wants to see in school graduates. How should a graduate of the “supercomputer world” school differ from a graduate of today? If there is an order, there will be a result. Fresh example. On the first day of Russian Supercomputing Days-2015, two reports voiced the idea that the speed of modern supercomputers is determined not by the power of processors (which is the focus of public attention), but by the speed of RAM. It is this that becomes the bottleneck, the throughput of which determines the productivity of the entire system. As a result, on the second day of the conference, participants in the teacher’s master class tested a game invented by the author of this article, demonstrating the interaction of the central processor, RAM and cache memory. The order and form of presentation of the material is an open question.

The material should be demonstrated using examples not related to the operation of a computer. Performers must manipulate material objects.

As much of the training as possible should be in the nature of business (organizational and activity) games.

Fulfilling these requirements will make it easier to understand the material being studied. This will be useful both when using this technique in computer science lessons at school (including elementary!), and when teaching adults: computer science teachers and students. A schoolchild, a school teacher, a student of a non-core specialty will be able to stop at the level of familiarization and understanding. The professional student will have to take the next step and move from acquaintance to studying these mechanisms at a professional level. But this is already a step beyond the methods of initial familiarization with the topic.

The author of this article began work on preparing a methodology for studying parallel computing in 2013 during the preparation of the TRIZformashka-2013 competition and continued in subsequent years.

(“TRIZformashka” is an interregional Internet competition in computer science, system analysis and TRIZ. Held annually in the second half of March. The age of participants is from the first grade to the fourth year. Geography - from Vladivostok to Riga. The average number of participants is 100 teams (300 people .), maximum - 202 teams (more than 600 people). Competition website www.trizformashka.ru.) Then, in 2013, the goal of the work was formulated as follows:

1. Within two to three years, prepare a description of performers, a set of games and tasks related to parallel computing;

2. Offer them (in parts, annually) to the participants of the competition;

3. Analyze their reaction (assess the number of solvers, their age, the success of the solution, typical errors, detected inaccuracies in the formulation of problems, etc.). The TRIZformashka competition turned out to be a convenient tool for debugging problems, since

made it possible to obtain reactions from all ages (from first grade to fourth year), from different regions, from different educational institutions.

Over the past years, the following set of methodological tools and platforms for their testing have been prepared.

1. Tasks on parallelism, starting from 2013, were included in the “TRIZformashka” competition (starting from 2013, the competition has the subtitle “Parallel Computing”). A list of task types is given below;

2. A chapter on parallelism has been prepared for the new version of the computer science textbook for grade 4. The material was tested in 3rd and 4th grades of Lyceum No. 10 in Perm;

3. The computer game “Tank Crew” has been developed and used since 2014 in the TRIZformashka competition;

4. A number of games have been developed and tested, which reflect the following issues:

Coordination of the activities of performers. Various types of approval;

Performing the same work by one performer and a group of performers. Dependence of work speed on the number of performers. Nonlinear increase in work speed with an increase in the number of performers. Critical path. Optimal number of performers. Optimal loading of performers. Optimal procedure;

Resources. Shared and non-shared resources;

Competition between performers for resources. Blocking. Clinch (deadlock). The following types of problems were proposed and tested:

1. Tasks on types of approval. (What types of coordination exist in the school cafeteria?);

2. Game "Tank Crew". Assignment to construct a parallel algorithm;

3. Performer “Construction”. At the same time, working teams build a structure from horizontal and vertical beams. Tasks include tasks for the execution of the specified algorithm, for the development of a new algorithm, for searching for errors in a given algorithm, for researching algorithms (comparing construction times using different algorithms, comparing construction costs, assessing the possibility of saving by redistributing labor, etc.);

4. Competition for resources. Three little pigs each cook their own lunch. For each piglet it is indicated what dishes he prepares, what resources (equipment, utensils, etc.) he needs for this and for how long these resources should be used. It is necessary to draw up a work schedule for each piglet, if he cooks alone in the kitchen, if they cook in pairs, if all three cook at once. Cooking time should be minimized;

5. Network diagram. A network diagram is given. It is required to depict (schematically) the structure that will be built, determine how many days it will take for construction with a particular number of teams, what part of the work will be completed by a certain time;

6. Tiered-parallel forms. Planning work according to various criteria. The work assignment, worker productivity, and payment rules are given. It is necessary to determine the number of workers needed to complete the work in a given time, determine the period of work for a given number of workers, determine the number of workers needed to minimize the cost of work;

7. Gantt charts. The text describes the work plan for the reconstruction of the workshop: duration and mutual sequence of actions, required workers. It is necessary to determine the deadline for the completion of the object, the change in the deadline due to certain changes in the workforce, and the list of workers involved on a specific date.

8. Coordination of repetitive work. Let the task be given to produce a batch of devices in a minimum time, provided that each device must be processed on different equipment; there are different amounts of equipment with different productivity. It is necessary to plan the start and operation time of each equipment and minimize downtime.

Today we have the following results:

1. An approach has been formulated for studying the topic of “parallel computing”: to go not from the problems of computer science, but “from life”, to focus on “joint activity”;

2. A list of issues has been formulated that are proposed to be reflected in the initial course of parallel computing;

3. Some classes of problems are formulated. Based on the accumulated experience, you can evaluate what kind of problems should be invented;

4. A set of problems of the named classes has been prepared. The problems were tested in the TRIZformashka competitions in 2013, 2014, 2015. and/or in elementary school (in classes with students of the third and fourth grades of Lyceum No. 10 in Perm);

5. A set of business games has been prepared. The games have been tested in primary schools and at a number of events for teachers. In particular, they were presented at the school track of the Summer Supercomputer Academy of Moscow State University in 2014, at a master class for teachers at Russian Supercomputing Days-2015, at several other conferences (including at the IT-education-2015 conference of the APKIT association) and other events for computer science teachers;

6. A set of texts about parallelism has been prepared for a fourth grade textbook. The texts were tested at Lyceum No. 10 in Perm;

7. The computer game “Tank Crew” has been prepared. The game was tested in the TRIZformashka competitions in 2014 and 2015;

8. The TRIZformashka competition has proven itself as a testing platform;

9. The task of “castling” in the process of teaching algorithmization is formulated: immediately teach parallel programming, presenting a sequential algorithm as part of a parallel one. There are thoughts on how this idea can be implemented. There is an opportunity to try out these ideas during the current school year (for students in grades 4-5);

10. There is a need, desire and opportunity to continue working.

Literature

1. Algorithms: grades 5-7: Textbook and problem book for general education. educational institutions /A.K. Zvonkin, A.G. Kulakov, S.K. Lando, A.L. Semenov, A.Kh. Shen. - M.: Bustard, 1996.

2. Bosova L.L. Parallel algorithms in primary and secondary schools. //Informatics at school. 2015, no. 2. P.24-27.

3. Voevodin V.V. Computational mathematics and the structure of algorithms: Lecture 10 about why it is difficult to solve problems on parallel architecture computing systems and what additional information you need to know. to successfully overcome these difficulties: a textbook. M.: MSU Publishing House 2010.

4. Gavrilova I.V. First trip to the “parallel world”. //Informatics at school. 2015, No. 6. P.16-19.

5. Dieter M.L., Plaksin M.A. Parallel computing in school computer science. Game "Construction". //Informatics at school: past, present and future.: materials of the All-Russian. scientific method. conf. on the use of ICT in education, February 6-7, 2014 /Perm. state national research univ. - Perm, 2014. - P.258-261.

6. Ivanova N.G., Plaksin M.A., Rusakova O.L. TRIZformashka. //Computer science. N05 Retrieved 10/10/2015.

14. Plaksin M.A. Computer science: textbook for 4th grade: 2 hours / M.A.Plaksin, N.G.Ivanova, O.L.Rusakova. - M.: BINOM. Knowledge Laboratory, 2012.

15. Plaksin M.A. About the method of initial acquaintance with parallel computing in high school. //Informatics at school: past, present and future.: materials of the All-Russian. scientific method. conf. on the use of ICT in education, February 6-7, 2014 /Perm. state national research univ. - Perm, 2014. - P.256-258.

16. Plaksin M.A. A set of business games for introducing parallel computing in elementary school. //Teaching information technologies in the Russian Federation: materials of the Thirteenth open All-Russian conference “IT-0education-2015” (Perm, May 14-15, 2015). Perm State National Research University, - Perm, 2015. P.60-62.

17. Plaksin M.A., Ivanova N.G., Rusakova O.L. A set of tasks for getting acquainted with parallel computing in the TRIZformashka competition. //Teaching information technologies in the Russian Federation: materials of the Thirteenth Open All-Russian Conference “IT Education-2015” (Perm, May 14-15, 2015). Perm State National Research University, - Perm, 2015. pp. 232-234.

18. Sokolovskaya M.A. Methodological system for teaching the basics of parallel programming to future computer science teachers.: abstract. dis. ...cand. ped. Sciences, Krasnoyarsk, 2012.

Annotation: What makes education change, parallel computing at the intersection of disciplines, sequential computing masks developmental problems, the need to teach how to solve problems effectively, the cause of many difficulties is ignorance of the structure of algorithms, possible ways to change the situation

It is known that mastering parallel architecture computing technology, especially by young specialists, comes with great difficulties. In our opinion, this is due to the fact that acquaintance with parallel computing, as well as education in this area in general, does not begin with where one should begin. In addition, what you need to start with is not covered in any courses at all.

The ability to quickly solve problems using parallel architecture computing technology forces users to change their entire habitual style of interaction with computers. Compared, for example, with personal computers and workstations, almost everything changes: other programming languages are used, most algorithms are modified, users are required to provide numerous non-standard and difficult-to-obtain characteristics of the tasks being solved, the interface ceases to be friendly, etc. An important fact is that failure to fully take into account new operating conditions can significantly reduce the efficiency of using new and, moreover, quite expensive equipment.

It should be noted that the general nature of the difficulties accompanying the development of parallel computing looks generally the same as it was in the days of sequential computing. Only for parallel computing do all the difficulties appear in a more acute form. Largely due to the greater complexity of the subject area. But, perhaps, mainly due to the fact that by the beginning of the active introduction of parallel architecture computing systems into the practice of solving large applied problems, the necessary theoretical foundation had not been built and the mathematical apparatus of research had not been developed. In the end, because of this, the entire educational cycle in the field of parallel computing turned out to be unprepared in a timely manner, the echoes of which are still evident today. Hence the lack of understanding of the numerous difficulties in mastering modern computer technology, gaps in the training of the necessary specialists, and much more.

Education in the field of parallel computing is based on three disciplines: computer systems architecture, programming and computational mathematics. If you carefully analyze the content of the relevant courses, you will inevitably come to the conclusion that not only individually, but even all together, they currently do not ensure the achievement of the main custom goals - to learn effectively solve large problems on large computing systems of parallel architecture. Of course, these courses provide a lot of useful and necessary information. However, much that needs to be known according to the modern view of parallel computing is not given in them. This, in particular, is due to the fact that a number of the most important and even fundamental facts, methods and technologies for solving large problems on large systems arose as a result of research at the intersection of several subject areas. Such results do not fit into the framework of traditional disciplines. Therefore, as a consequence, the information presented in the relevant courses turns out to be insufficient for the formation whole system knowledge focused on the competent construction of parallel computing processes.

All educational courses that are in one way or another related to computer technology or its use can be divided into two groups. The first group contains basic information, the second - special information. Basic information is universal in nature and is poorly classified by type of computer technology. They were formed on the basis of knowledge about sequential machines and sequential calculations and change little over time. As part of the programming course, basic information begins to be read from the first or second semester, as part of the numerical methods course from approximately the third semester. Special courses, including those related to parallel architecture computing systems, begin to be taught quite late. As a rule, not earlier than the seventh or even ninth semester.

At first glance, everything looks logical: first, basic information is given, then special information. However, in practice, the division of information into basic and special turns out to be very conditional, since only the following is important: is it possible to obtain the necessary information at the right time or is there no such opportunity and what is the set of information offered for study.

The development of computational mathematics has a long history. But its most rapid development is associated with electronic computers. These machines arose as a tool for carrying out consecutive calculations. While developing rapidly, they essentially remained consistent for several decades. For sequential machines, they began to create quite early machine independent programming languages. For mathematicians and application software developers, the emergence of such languages offered an attractive prospect. There was no need to delve into the design of computers, since programming languages essentially differed little from the language of mathematical descriptions. The speed of implementation of algorithms on sequential machines was determined mainly by the number of operations performed and was almost independent of how the algorithms themselves were internally structured. Therefore, in the development of algorithms, the main objective functions of their quality became obvious - minimizing the number of operations performed and resistance to the influence of rounding errors. No other information about algorithms was simply needed to effectively solve problems using sequential technology.

All this for many years determined the main direction of development not only of numerical methods, but also of all computational mathematics. Against the background of insufficient attention to the development of computer technology, mathematicians did not notice an important circumstance in time: quantitative changes in technology are already turning into such qualitative ones that communication with it using sequential languages should soon become impossible. This has led to a serious gap between existing knowledge of algorithms and the knowledge needed to quickly solve problems on the latest computing technology. The resulting gap underlies many of the difficulties in the practical development of modern parallel architecture computing systems.

Now the computing world, at least the world of big computing, has changed radically. It became parallel. On parallel architecture computing systems, the time it takes to solve problems fundamentally depends on what the internal structure of the algorithm is and in what order its operations are performed. The possibility of accelerated implementation on parallel systems is achieved due to the fact that they have a sufficiently large number of functional devices that can simultaneously perform some operations of the algorithm. But to use this opportunity, it is necessary to obtain new information regarding the structure of the algorithm at the level of connections between individual operations. Moreover, this information must be consistent with information about the architecture of the computing system.

Almost nothing is said in educational courses about the joint analysis of system architecture and algorithm structure. While architectures of computer systems and parallel programming are taught at least in special courses, discussion of the structures of algorithms at the level of individual operations is currently not included in any educational disciplines. And this despite the fact that the structures of algorithms have been discussed in the scientific literature for several decades, and the practice of using parallel architecture computing technology has not been for a much shorter period. Naturally, the question arose about what to do next. The answer to this has already been given before, but it is useful to repeat it.

Until now, computational mathematicians have been taught how to solve problems mathematically correctly. Now we also need to teach how to solve problems effectively using modern computer technology.

What information in the field of algorithm structure you need to know additionally was discussed in the given lectures. Based on this material, various programs for modernizing educational courses can be developed. in the interests of parallel computing. The most effective modernization involves agreed upon changes to several courses. One of the programs, designed to train highly qualified specialists to solve large problems on large systems, might look like this:

reading three or four lectures “Introduction to Parallel Computing” in the first courses;
introduction to basic cycles in mathematics and programming of basic information about parallel computing;
a significant restructuring of the series of lectures on numerical methods with a mandatory description of the information structure of each algorithm;
organizing a workshop on parallel computing;
reading a special course " Parallel structure algorithms";
reading a special course "Parallel Computing".

This program is in a certain sense the maximum. However, it is very real. Of course, it cannot be fully implemented in every university. But on its basis, each specific university can formulate its own educational program in the field of computer science.

From the first two points, either one of them, or both at once, can be introduced into the educational cycle. It is only important that the student as soon as possible I learned that there are other ways to organize computing processes, and not just sequential execution “operation by operation”, that the most powerful modern computer technology is built on these other methods, that only with such technology it is possible to solve large industrial and scientific problems, etc. It is important, first of all, to draw students’ attention as early as possible to the need for a critical attitude towards the philosophy of sequential computing. After all, it is this philosophy that they have to deal with throughout their entire education, both at school and at university. And it is precisely this philosophy that hinders the understanding of the features of working on parallel architecture computers.

It is quite appropriate to include basic information about parallel computing in a programming course. In it you can discuss the simplest model parallel computing system, talk about parallel processes and their characteristics. Here it is useful to introduce an abstract form of description of computational algorithms. Moreover, it is not at all necessary to provide specific contents. It’s better to talk about this later when studying numerical methods. We can start talking about parallel forms of algorithms and their uses. All information about parallel computing, in our opinion, can be presented in a programming course in two or three lectures. A good testing ground for demonstrating parallelism in algorithms is a linear algebra course. Matrix operations and the Gaussian method for solving systems of linear algebraic equations appear quite early in it. Using the appropriate algorithms, even “on your fingers,” you can demonstrate parallelism of calculations, fast algorithms, and much more. Discussion of new information will require no more than one lecture in total.

You should not overload your first introduction to parallel computing with a lot of details and serious results. The main goal of this stage is only to arouse interest in this topic. It is enough to give a general idea of computational parallelism, parallel forms, algorithm graphs and characteristics of computational processes. If all the initial information is combined into a single series “Introduction to Parallel Computing,” then it can be covered in three or four lectures. But let us emphasize once again that it is necessary to bring this information to students as early as possible.

The materials in these lectures convincingly demonstrate how important a good knowledge of algorithm graphs and their parallel forms is for understanding the problems that one has to face when solving problems on modern parallel architecture computing systems. It is most natural to include information about this in courses on computational mathematics. The main argument in favor of such a solution is related to the fact that the information structure of the algorithms is described in the same index systems in which the presentation and numerical methods take place. By and large, all that needs to be added to the existing course on numerical methods is information about algorithm graphs, sets of scans for them, and the technology for using all this. Of course, preparing updated courses requires some work. But it is not at all necessary to discuss the structures of algorithms in full. It is enough to do this only for their computing cores. The concept of algorithm graphs and technology should most likely be presented at the very beginning of the course. However, it is advisable to provide a graph and developments for each algorithm. In addition to the above, we note that the same numerical methods are read without change for many years, and information about their structures needs to be prepared only once. And this information will certainly be used many times, and in a variety of areas.

One of the most difficult technically and least developed from a methodological point of view is the issue of organizing a workshop on parallel computing. Of course, to carry it out you need to have parallel computing technology. But in many universities this technology has been available for a long time, and there is still no final opinion on what the workshop should be like.

One of the obvious goals of the workshop lies on the surface. Parallel architecture computing systems are created to solve large problems. Therefore, during the workshop you need to at least to some extent learn how to solve such problems. Everything seems to be clear. A general programming course or some special courses provide knowledge on specific languages or parallel programming systems. During the workshop, specific tasks are given. Programs are compiled, run on a computer system, and the results are compared with a standard. However, even such a simple scheme has bottlenecks. In fact, what is considered a result? When solving large problems on large systems, the main difficulty is not so much obtaining mathematically correct result how much achievement desired level of acceleration. This means that during the practical training, you also need to learn how to correctly evaluate the effectiveness of compiled programs.

If a particular program does not show the required performance characteristics, then the question arises about further actions. In this situation, you almost always have to begin a more detailed study of the structure of the algorithms. Perhaps it is the study of the structure of algorithms that should become key element of the workshop. It would be a good help in introducing the structure of algorithms in modernized courses on numerical methods. But there are also stronger arguments for becoming more familiar with the structure of algorithms.

One argument has to do with current issues. Recently, various multiprocessor systems with distributed memory have become widely used in computing practice. These include not only clusters, but also heterogeneous networks of computers, networks of computers connected via the Internet, etc. In all such systems, the bottleneck is information exchange between processors. To operate efficiently, it is necessary that each processor perform quite a lot of operations and exchange relatively small pieces of data with the memory of other processors. We have already noted in lectures that to ensure such a calculation mode, it is enough to know the algorithm graph and at least two independent developments. In other words, you need to know the structure of the algorithms.

Another argument is related to the possible prospects for the development of computer technology. The speed of solving large problems has to be increased today and will certainly have to be increased in the future. As a rule, the main hopes are associated with the creation of faster speeds based on various technological achievements. universal systems But it is possible to increase the speed of computer technology due to its specializations. The use of special processors that carry out very fast implementation of fast Fourier transform algorithms, signal processing, matrix operations, etc. has long been practiced. Now let’s remember the hypothesis about typical structures. If it is correct, then in specific application areas it will be possible to identify the most frequently used algorithms and build special processors for them. This opens up the way to create specialized computing systems for fast and ultra-fast solution of problems from a given subject area.

The main difficulty in introducing tasks related to the study of the structure of algorithms into the workshop is the current lack of accessible and easy-to-use software for constructing algorithm graphs and conducting various studies based on them. Essentially, there is only one system that implements such functions. This is a V-Ray system developed at the Research Computing Center of Moscow State University. It makes it possible for various classes of programs to build graphs of algorithms and study their parallel structure. The V-Ray system is implemented on a personal computer and is independent of the target computer. The last circumstance is extremely important for organizing a workshop, since frequent access to large computer systems with small tasks is not very realistic even for universities with good technical equipment. On personal computers, the time for mastering the tasks of the workshop is practically unlimited. Currently, V-Ray represents a complex research system. Not all of its functions are needed to organize a workshop. Over time, the V-Ray system will become available for widespread use. Information about it and its capabilities can be obtained

There are various ways to implement parallel computing. For example, each computing process may be implemented as an operating system process, or computing processes may be a collection of threads of execution within a single OS process. Parallel programs can be physically executed either sequentially on a single processor - alternating in turn the steps of each computational process, or in parallel - allocating one or more processors (located nearby or distributed in a computer network) to each computational process.

The main difficulty in designing parallel programs is to ensure the correct sequence of interactions between different computing processes, as well as the coordination of resources shared between processes.

Methods for synchronizing parallel interaction

In some parallel programming systems, data transfer between components is hidden from the programmer (for example, using a promise mechanism), while in others it must be specified explicitly. Explicit interactions can be divided into two types:

Message-passing parallel systems are often easier to understand than shared memory systems and are generally considered a superior method of parallel programming. There is a wide range of mathematical theories for the study and analysis of message passing systems, including the actor model and various types of process calculus. Messaging can be efficiently implemented on symmetric multiprocessors with or without shared coherent memory.

Distributed memory parallelism and message passing parallelism have different performance characteristics. Typically (but not always), the overhead of process memory and task switching time is lower for message-passing systems, but message passing is more expensive than procedure calls. These differences are often overshadowed by other factors that affect performance.

Of course, in such a system you can also use the message passing method exclusively, that is, run a separate process on each processor of each node. In this case, the number of processes (and threads) will be equal to the number of processors on all nodes. This method is simpler (in a parallel program you only need to increase the number of processes), but is less efficient, since the processors of the same node will exchange messages with each other as if they were on different machines.

Typical tasks that allow parallel computing

map - execution of the same function on each element of the input data array, obtaining an array of calculation results equal in size
reduce - execution of the same function to add the contribution of each element of the input data to one final value

Software parallelism tools

OpenMP is an application interface standard for shared memory parallel systems.
POSIX Threads is a standard for implementing execution threads.
Windows API - multi-threaded applications for C++.
PVM (Parallel Virtual Machine) allows you to combine a heterogeneous (but networked) set of computers into a common computing resource.
MPI (Message Passing Interface) is a standard for message passing systems between parallel executing processes.

Write a review about the article "Parallel Computing"

Literature

Dictionary of Cybernetics / Edited by Academician V. S. Mikhalevich. - 2nd. - Kyiv: Main editorial office of the Ukrainian Soviet Encyclopedia named after M. P. Bazhan, 1989. - 751 p. - (C48). - 50,000 copies. - ISBN 5-88500-008-5.
. - IBM RedBook, 1999. - 238 p.(English)
Voevodin V.V., Voevodin Vl. IN. Parallel computing. - St. Petersburg: BHV-Petersburg, 2002. - 608 p. - ISBN 5-94157-160-7.
Olenev N. N.. - M.: Computer Center RAS, 2005. - 80 p. - ISBN 5201098320.

Notes

Links

(English)
(English)

Parallel Computing

General provisions

Levels of parallelism

Flow of execution

Theory

Elements

Interaction

Programming

Computer technology

API

Problems

An excerpt characterizing Parallel Computing

The spirit of the army is a multiplier for mass, giving the product of force. To determine and express the value of the spirit of the army, this unknown factor, is the task of science.
This task is possible only when we stop arbitrarily substituting instead of the value of the entire unknown X those conditions under which force is manifested, such as: orders of the commander, weapons, etc., taking them as the value of the multiplier, and recognize this unknown in all its integrity, that is, as a greater or lesser desire to fight and expose oneself to danger. Then only by expressing known historical facts in equations and by comparing the relative value of this unknown can we hope to determine the unknown itself.
Ten people, battalions or divisions, fighting with fifteen people, battalions or divisions, defeated fifteen, that is, they killed and captured everyone without a trace and themselves lost four; therefore, four were destroyed on one side and fifteen on the other. Therefore four was equal to fifteen, and therefore 4a:=15y. Therefore, w: g/==15:4. This equation does not give the value of the unknown, but it does give the relationship between two unknowns. And by subsuming various historical units (battles, campaigns, periods of war) under such equations, we obtain series of numbers in which laws must exist and can be discovered.
The tactical rule that one must act in masses when advancing and separately when retreating unconsciously confirms only the truth that the strength of an army depends on its spirit. In order to lead people under the cannonballs, more discipline is needed, which can only be achieved by moving in masses, than in order to fight off attackers. But this rule, which loses sight of the spirit of the army, constantly turns out to be incorrect and is especially strikingly contrary to reality where there is a strong rise or decline in the spirit of the army - in all people's wars.
The French, retreating in 1812, although they should have defended themselves separately, according to tactics, huddled together, because the spirit of the army had fallen so low that only the mass held the army together. The Russians, on the contrary, according to tactics, should attack en masse, but in reality they are fragmented, because the spirit is so high that individuals strike without the orders of the French and do not need coercion in order to expose themselves to labor and danger.

The so-called partisan war began with the enemy’s entry into Smolensk.
Before guerrilla warfare was officially accepted by our government, thousands of people of the enemy army - backward marauders, foragers - were exterminated by the Cossacks and peasants, who beat these people as unconsciously as dogs unconsciously kill a runaway rabid dog. Denis Davydov, with his Russian instinct, was the first to understand the meaning of that terrible club, which, without asking the rules of military art, destroyed the French, and he is credited with taking the first step to legitimize this method of war.
On August 24, Davydov’s first partisan detachment was established, and after his detachment others began to be established. The further the campaign progressed, the more the number of these detachments increased.
The partisans destroyed the Great Army piece by piece. They picked up those fallen leaves that fell of their own accord from the withered tree - the French army, and sometimes shook this tree. In October, while the French were fleeing to Smolensk, there were hundreds of these parties of various sizes and characters. There were parties that adopted all the techniques of the army, with infantry, artillery, headquarters, and the comforts of life; there were only Cossacks and cavalry; there were small ones, prefabricated ones, on foot and on horseback, there were peasant and landowner ones, unknown to anyone. There was a sexton as the head of the party, who took several hundred prisoners a month. There was the elder Vasilisa, who killed hundreds of French.
The last days of October were the height of the partisan war. That first period of this war, during which the partisans, themselves surprised at their audacity, were afraid at every moment of being caught and surrounded by the French and, without unsaddled or almost getting off their horses, hid in the forests, expecting a pursuit at every moment, has already passed. Now this war had already been defined, it became clear to everyone what could be done with the French and what could not be done. Now only those detachment commanders who, with their headquarters, according to the rules, walked away from the French, considered many things impossible. The small partisans, who had long since begun their work and were closely looking out for the French, considered it possible what the leaders of large detachments did not dare to think about. The Cossacks and men who climbed among the French believed that now everything was possible.
On October 22, Denisov, who was one of the partisans, was with his party in the midst of partisan passion. In the morning he and his party were on the move. All day long, through the forests adjacent to the high road, he followed a large French transport of cavalry equipment and Russian prisoners, separated from other troops and under strong cover, as was known from spies and prisoners, heading towards Smolensk. This transport was known not only to Denisov and Dolokhov (also a partisan with a small party), who walked close to Denisov, but also to the commanders of large detachments with headquarters: everyone knew about this transport and, as Denisov said, sharpened their teeth on it. Two of these large detachment leaders - one Pole, the other German - almost at the same time sent Denisov an invitation to each join his own detachment in order to attack the transport.
“No, bg”at, I’m with a mustache myself,” said Denisov, having read these papers, and wrote to the German that, despite the spiritual desire that he had to serve under the command of such a valiant and famous general, he must deprive himself of this happiness, because he had already entered under the command of a Pole general. He wrote the same thing to the Pole general, notifying him that he had already entered under the command of a German.
Having ordered this, Denisov intended, without reporting this to the highest commanders, together with Dolokhov, to attack and take this transport with his own small forces. The transport went on October 22 from the village of Mikulina to the village of Shamsheva. On the left side of the road from Mikulin to Shamshev there were large forests, in some places approaching the road itself, in others a mile or more away from the road. Through these forests all day long, now going deeper into the middle of them, now going to the edge, he rode with Denisov’s party, not letting the moving French out of sight. In the morning, not far from Mikulin, where the forest came close to the road, Cossacks from Denisov’s party captured two French wagons with cavalry saddles that had become dirty in the mud and took them into the forest. From then until the evening, the party, without attacking, followed the movement of the French. It was necessary, without frightening them, to let them calmly reach Shamshev and then, uniting with Dolokhov, who was supposed to arrive in the evening for a meeting at the guardhouse in the forest (a mile from Shamshev), at dawn, fall from both sides out of the blue and beat and take everyone at once.
Behind, two miles from Mikulin, where the forest approached the road itself, six Cossacks were left, who were supposed to report as soon as new French columns appeared.
Ahead of Shamsheva, in the same way, Dolokhov had to explore the road in order to know at what distance there were still other French troops. One thousand five hundred people were expected to be transported. Denisov had two hundred people, Dolokhov could have had the same number. But superior numbers did not stop Denisov. The only thing he still needed to know was what exactly these troops were; and for this purpose Denisov needed to take a tongue (that is, a man from the enemy column). In the morning attack on the wagons, the matter was done with such haste that the French who were with the wagons were killed and captured alive only by the drummer boy, who was retarded and could not say anything positive about the kind of troops in the column.
Denisov considered it dangerous to attack another time, so as not to alarm the entire column, and therefore he sent forward to Shamshevo the peasant Tikhon Shcherbaty, who was with his party, to capture, if possible, at least one of the French advanced quarterers who were there.

Parallel and sequential computing. Types of parallel computing

Transcript

Methods for synchronizing parallel interaction

Typical tasks that allow parallel computing

Software parallelism tools

see also

Write a review about the article "Parallel Computing"

Literature

Notes

Links

An excerpt characterizing Parallel Computing

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts