Sequential and parallel processing of information. Parallel information processing


Parallel processing data

Computer science, cybernetics and programming

Automatic detection of concurrency. Degree and levels of parallelism. Types of parallelism. The performance of parallel VS depends on many factors, and to a large extent on the architecture and structure of the system draw the structure of a parallel system and explain: on the degree and level of parallelism in the system; from the organization of data transfer between parallel processors; from the switching system; from the interaction of processors and memory; on the relationship between the hardware and software implementation of a macro operation.

Lecture 1

Parallel data processing

Plan

1. Tier-parallel form of the algorithm.

2. Automatic detection of parallelism.

3. Degree and levels of parallelism.

4. Types of parallelism.

Parallelism this is the ability to simultaneously perform several arithmetic, logical or service operations. Moreover, operations can be both large-block and small-block.

The performance of parallel WANs depends on many factors and to a large extent on the architecture and structure of the system.(draw the structure of a parallel system and explain):

From the degree and level of parallelism in the system;

From the organization of data transfer between parallel processors;

From the switching system;

From the interaction of processors and memory;

From the relationship between hardware and software implementation of a macro operation.

Parallel processing can be based on various principles:

Spatial parallelism;

Temporal parallelism:

  1. Pipelining.
  2. Vectorization.
  3. Matrix.
  4. Systolic.
  5. Organization of the data flow processing structure.
  6. Organization of the system based on the hypercube structure.
  7. Dynamic restructuring of the aircraft.

The description of any algorithm is hierarchical, based on the nesting property. When programming, allocatenesting levels:tasks, tasks, subtasks (processes), macro-operations, operations.Nesting determines the depth of parallelization and is one of the important properties algorithms in the analysis of parallel computing models.

1. Tier-parallel form of the algorithm

The most general form of representation of algorithms is the information-control graph of the algorithm,which reflects the data dependence between the statements of the algorithm and unconditional and conditional jumps in the program. Such a graph implicitly contains all types of parallelism for the chosen method of solving the problem.A more specific form of representation of task parallelism is the apparatus of a tiered-parallel form (LPF).

Algorithm in tiered-parallel formis represented in the form of tiers, and the zero tier includes operators (branches) independent of each other.

The graph can be marked transitions , meaning the transfer of the results of computing a primitive operation from one tier to an operation from the next tier. Tiers are divided by transitions. Can be"empty" transitions and "empty" primitive operations. An empty operation corresponds to saving the result obtained in the previous tier. In a sequential chain of activities, an empty activity can be placed in any tier.

When constructing the JPF, they rely on basic set primitive operations (PNO). The tiered-parallel form is characterized by the following parameters :

1. Count length (number of tiers) L.

2. Width of the i-th tier - b i.

3. The width of the graph of a tiered-parallel form B = max (b i ).

4. Average width of the JPF graph Wed .

5. Fill factor i -th tier k i .

6. Scatter coefficient of operations in the graph - Q j i , j BNO , where is the number of j -th type of operations in i-th tier.

7. The minimum required number of calculators (from BNO) to implement the algorithm represented by this graph in the JPF.

8. Minimum time algorithm solutions (the sum of the response times of calculators with the maximum amount of calculations for each tier) Tmin.

9. Connectivity of the algorithm (the number of intermediate results that must be stored during the implementation of the algorithm) FROM .

2. Automatic detection of concurrency

There are two ways to construct a parallel algorithm: directly from the problem statement or by transforming a sequential algorithm.

Methods for constructing a parallel algorithm from a sequential one are based on the selection of typical frequently occurring constructions in a sequential algorithm, which, according to certain rules, are replaced by parallel ones. (This allows, to a certain extent, to raise the degree of parallelism lost by the algorithm when programming in a sequential language.)

The nature of the change in the degree of parallelism during the preparation of the computer program is shown in Fig. 2.2.

potential parallelism

Method

solutions

Source text

Machine program

Rice. 2.2. Change of potential parallelism during program development:

1 parallel programming system;

2 sequential programming and

vectorizing compiler

Despite the lower level of parallelism achieved when constructing a parallel algorithm by converting from a serial one, this method is widely used, since it makes it possible to use expensive application programs developed and debugged for serial DOD.

In a sequential program, there are overt and covert parallel processing.

When analyzing the program, a data flow graph is built. To detect the explicit parallelism of processes, the sets of input (read) variables are analyzed R and output (recordable) variables W each process.

Explicit parallel processing can be detected among processes i and j (i ≠ j ) satisfying the following conditions:

input from one process must not be modified (written) by another process

no two processes should modify shared variables

a) R i W j =;

b) W i R j =;

c) W i W j =;

Hidden parallel processing requires some kind of transformation procedure for a serial program to make it possible to execute it in parallel. The transformation can be as follows:

a) reducing the height of trees of arithmetic expressions (Fig. 2.3). For arithmetic expressions with n variables or constants, reducing the height of the tree allows you to achieve faster processing of the order O(n/log2n ) using O(n) processors;

b) transformation of linear recurrent relations;

((a + b) + c) + d

(a + b)+ (c + d)

Rice. 2.3. Reducing the height of a tree

c) replacement of operators;

d) transformation of blocks of conditional transitions and cycles to the canonical form;

e) distribution of cycles.

Parallel architectures achieve high performance if the parallelism transformation takes into account the peculiarities of the host architecture on which the algorithm is supposed to be executed.

When converting parallelism, programs take into account: 1) the layout of data in memory; 2) memory addressing (indexing); 3) choice of data route (method of connecting processors and memory).

Fig.2.4. Storage

shift matrices

As an example of considering the layout in memory, let's take memory with diagonal addressing. To ensure parallel processing of matrices, the elements of their rows and columns must be distributed among the memory devices of the processors in such a way that they can be read and processed simultaneously. In this case, the matrix is ​​stored with a shift (Fig. 2.4).

Any algorithm contains consecutive (scalar) sections. It is proved that the length of these scalar segments is the determining factor in the implementation of the algorithm on a parallel aircraft.

3. Degree and levels of parallelism

Degree of parallelism(D) this is the order of the number of parallel devices in the system when implementing the task algorithm, provided that the number of processors (processing devices) is not limited.(There is another definitiondegrees of parallelismis the number of processors of a multiprocessor system participating in the execution of the program in parallel at each time t.)

1) Low degree: 2 to 10 processors.

2) Medium degree: 10 to 100 processors.

3) High degree: 100 to 10 4 processors.

4) Super high degree: from 10 4 to 10 6 processors.

Rice. 2.5. Concurrency Profile

Graphical representation of the parameter D(t ) as functions of time are calledprogram concurrency profile. Changes in processor utilization during the observation time depend on many factors (algorithm, available resources, degree of optimization provided by the compiler, etc.). On fig. Figure 2.5 shows a typical concurrency profile.

There is a wide range of potential parallelism in application programs. In computationally intensive programs, from 500 to 3500 arithmetic operations can be performed in parallel in each cycle, if there is an existing computing environment for this. However, even a properly designed superscalar processor is capable of supporting 2 to 5.8 instructions per cycle. This drop is primarily due to communication and system costs.

The degree of parallelism significantly affects: the architecture of the aircraft, especially the switching system, the organization of the interaction of parallel processors and the methods of data exchange between processors and memory. A stronger influence on the performance of computing facilities than the degree of parallelism is exerted by the level of parallelism.

Consider algorithmic and schematic levels of parallelism.

There are the followingalgorithmic levels of parallelism:

1. Job Level:

a) between tasks;

b) between task phases.

2. Program level:

a) between parts of the program(parts of one task are executed on a set of calculators);

b) within cycles.

(If the individual iterations in the loop do not depend on each other. For example : For I:=1 to N do A(I):=B(I) + C(I))

3. Command level:

a) between the phases of command execution.

4. Arithmetic and digit level:

a) between elements of a vector operation;

b) inside the logic circuits of the ALU.

Each of the levels is characterized by certain properties, on the basis of which special structures of computing facilities have been developed. The command level is implemented in any modern computer, including personal computers.

Schema level of parallelismthis is the hardware level at which data processing is parallelized or parallel computing is organized.

Parallel processing can be implemented at the following circuit levels:

1. At the level of logic gates and memory elements.it lowest level transistor level. Here, parallel logic circuits are built from logic gates ( LS ) (for example: parallel adder).

2. The level of logic circuits and simple automata with memory.A parallel elementary automaton is built from logical circuits ( EA).

3. Level of registers and memory integrated circuits.On elementary automata, parallel microprocessor circuits are obtained ( MP).

4. The level of elementary microprocessors.Parallel macroprocessors are built from microprocessors to perform mid-block operations ( MAP).

5 . The level of macroprocessors that implement large operations.Here the parallelism of macro operations is realized. Macroprocessors are used to build parallel multiprocessor systems ( MPS).

6. The level of computers, processors and programs.The highest level of parallelism from multiprocessor systems is obtained by parallel computing systems ( VS).

4. Types of parallelism

4.1. Natural parallelism and

concurrency of many objects

In the information graph, "vertical" independent subgraphs can be distinguished, which do not mutually use any intermediate results obtained by implementing the primitive operations of another subgraph. This type of parallelism is called natural parallelism. independent tasks.

The task has natural concurrency, if in its original formulation it is reduced to an operation on multidimensional vectors, multidimensional matrices or lattice functions (Fig. 2.6). Intermediate results of tasks are not used here. Each task is programmed independently of the others. This type of parallelism does not require the integration of computers into complexes. However, an increase in the number of independent tasks in SOD increases throughput systems. For example: processing of transactions to a DB on multiprocessor servers.

1 task

2 task

Rice. 2.6. Information graph of a job characterized by natural parallelism

Ori

Ori

Ori

Ori

Ori+1

Ori+1

Ori+1

Ori+1

1

at 2

3

at 4

Rice. 2.7. Information Graph

task characterized

concurrency of a set of objects

Multiple Object Parallelismis a special case of natural parallelism. Its meaning is that the task is to process information about different, but the same type of objects, processed according to the same or almost the same program (Fig. 2.7).

Here relatively little weight is occupied by the so-calledintegral operations. The initial operands of integral operations are vectors or functions, or sets of objects, and the result is a number. For example, calculating the dot product for n-dimensional vectors

includes two types of operations: the pairwise product of vector components and then the "integral operation" (an operation on an n-dimensional vector) summing together all the components of this vector.

With the parallelism of a set of objects, more often than in the general case, there are situations when separate sections of calculations must be performed differently for different objects.

For example, when finding the values ​​of some functions, limited to a certain area. The values ​​inside the area for all points are calculated according to one formula, and on the boundary points according to another.

Parallelism of a set of objects is characterized by the following parameters :

1. Total program length L the lengths of all operators in all branches are summed up.

2. Average program length L cf is calculated based on the rank of the task.

The main quantitative characteristic of a parallelized task is task rank r (®) this is the number of parameters for which parallel processing should be carried out (for example, the number of vector components, the number of points at which the function is specified).

3. The value of the divergence of the problem D

If the information processing program for all r objects is exactly the same, then D =1 and the more the programs of different objects differ from each other, the more D.

4.2. Parallelism of Independent Branches

essence concurrency of independent branchesconsists in the fact that independent parts, called branches, can be distinguished in the program for solving the problem. If the CS has the appropriate hardware, the branches can be executed in parallel (Fig. 2.8).

Program branch Y does not depend on the branch x if:

Rice. 2.8. The information graph of a task characterized by

concurrency of independent branches

between them no functional links, i.e. none of the input variables of branch Y is an output variable of branch X or any branch dependent on X;

  1. between they have no connection on the working fields of memory;
  2. they must be fulfilledunder different programs;
  3. independent in management, i.e. the execution condition of the branch Y should not depend on the features generated during the execution of the branch X or the branch dependent on it.

4.3. Parallelism of adjacent operations or

local parallelism

Parallelism of adjacent operationstakes place when the input data for the current operations are obtained at earlier stages of the calculation and the construction of computational tools allows you to combine the execution of several operations that are not related to each other by output data and results.

Local parallelism is characterized by the following parameters :

1. Connectivity indicator of adjacent operationsis the probability that the result of some operation will be used in the operation following it. The lower the connectivity of an operation, the greater the depth of parallelism of adjacent operations for it. Typically, the value has values ​​of 0.10.5.

2. The probability that, starting from a given operation, there is a chain of length at least l l

3. The probability that, starting from any operation in the program, there is a chain of exactly l operations that can be performed simultaneously l

4. Depth of parallelism of adjacent operations L PSO is the expectation of the length of a chain of operations that can be performed simultaneously

Local optimization of programs consists in looking at several instructions that must be executed in a row, and changing the order of some of them, perhaps changing the numbers of registers and memory cells to ensure the maximum possible parallelism of adjacent operations.

In most cases, the connectivity index of adjacent operations depends not so much on the problem as on the quality of local optimization.

________________________________________________________________________________________________

Course "Organization of computers"

10 -

(course project)


As well as other works that may interest you

54055. Urochist vіdkrittya tyzhnya Logіka 149.5KB
Uchen. I allow the Captains to present the team and issue a report. The command to issue a report. 1 study. Uvaga Uvaga 2 study. Good day dear children and guests 1 student.
54056. Integration of primary subjects and logic 120.5KB
It is necessary for children to know the rules and laws of logic; The productivity of the integrated approach can be especially striking in the lessons of logic. The knowledge of the teacher of the basic rules and laws of logic gives me the ability to use logical tricks for an hour of solving problematic situations, whether it be illuminating; develop the rules and laws of logic in order to analyze the underlying phenomena of the assessment of one’s own and other people’s thoughts, formulate and accept the rounded decisions of the pid ...
54057. Interdisciplinary integration as a means of activating the educational process 135.5KB
In specialized schools with in-depth study of a foreign language, interdisciplinary integration should not take last place. In this regard, joint lessons of mathematics and English can be very interesting.
54058. ALGEBRA OF STATEMENTS. BASIC OPERATIONS OF PROPOSITIONAL ALGEBRA 1.77MB
A truth table is a table that establishes a correspondence between all possible sets of logical variables included in logic function and function values.
54059. Logics 81.18KB
Do you know this man entangled in a cloak No. By the way, this is your father. The object of logic is what the scientist's interest in logic is directed to - thinking on human thinking. Logic is a science not about all thinking, but about correct thinking about correct rational thinking, which can be expressed in symbolic symbolic form in words.
54061. Lis. tree. Bushes. Berries. Rozvykat zvyaznogo move 40KB
Meta: Improve vocabulary of children based on knowledge, statements about dovkillya. Learn to override the quality, power of objects, try to give them a description, form the most accurate words that fit to a specific situation, or I will describe.
54062. Come merry cochineal 44.5KB
Under the musical support, the children at once from the speech therapist go to the music hall. Speech therapist: Good morning, good day. Hai splash the little ones. Hai blunt the legs. Pіpіpі where I spent the Speech Therapist.
54063. Logopsychocorrection in robots with children with mental impairments 67.5KB
Igri and the right to develop the emotional sphere Kazka-gra: About fisherman and fish Gras with bdzhіlkoy tension and relaxation of myazіv nіg. Vedmeditsya kliche gold bdzhilku pogratisya z vedmezhatami.

"Parallelism as a way of parallel data processing"

Kotovsk2010

Introduction

The rapid development of science and the penetration of human thought into ever new areas, along with the solution of previously posed problems, constantly generates a stream of questions and poses new, as a rule, more complex tasks. At the time of the first computers, it seemed that increasing their speed by 100 times would solve most problems, but the gigaflop performance of modern supercomputers today is clearly insufficient for many scientists. Electro and hydrodynamics, seismic exploration and weather forecasting, modeling of chemical compounds, research virtual reality- this is a far from complete list of areas of science, the researchers of which use every opportunity to accelerate the implementation of their programs.

The most promising and dynamic way to increase the speed of solving applied problems is the widespread introduction of the ideas of parallelism in the operation of computing systems. Hundreds have been designed and tested to date. various computers that use one or another type of parallel data processing in their architecture. in scientific literature and technical documentation you can find more than a dozen different names that characterize only general principles the functioning of parallel machines: vector-pipeline, massively parallel, computers with a wide command word, systolic arrays, hypercubes, special processors and multiprocessors, hierarchical and cluster computers, dataflow, matrix computers and many others. If, for the sake of completeness of description, we add data on such names to such names important parameters, such as memory organization, communication topology between processors, synchronous operation individual devices or a way to perform arithmetic operations, then the number of different architectures will become completely boundless.

Attempts to systematize the entire multitude of architectures began after the publication by M. Flynn of the first version of the classification of computing systems in the late 60s and continue uninterruptedly to this day. Classification is very important for better understanding researched subject area, however, finding a successful classification can have a number of significant consequences.

The main issue of classification - what to lay in its basis, can be solved in different ways, depending on who this classification is created for and what task it is aimed at. Thus, the commonly used division of computers into personal computers, workstations, minicomputers, mainframe computers, minisupercomputers, and supercomputers allows, perhaps, to roughly estimate the cost of a computer. However, it does not bring the user closer to understanding what is required of him to write a program that runs at the performance limit of a parallel computer, i.e. for which he decided to use it.

The classification should help to understand what each architecture is, how they are interconnected, what must be taken into account for writing really effective programs or what class of architectures should be oriented to solve the required class of problems. At the same time, a successful classification could suggest possible ways to improve computers, and in this sense it should be quite meaningful. It is difficult to count on finding non-trivial "blank spots", for example, in the classification by cost, but thinking about possible systematics in terms of simplicity and manufacturability of programming can be extremely useful in determining the direction of the search for new architectures.

1. Parallel Computing Systems

Parallel computing systems are physical computer systems, as well as software systems that implement in one way or another parallel processing of data on many computing nodes.

The idea of ​​computational parallelism is based on the fact that most tasks can be divided into a set of smaller tasks that can be solved simultaneously. Usually parallel computing require coordination. Parallel computing comes in several forms: bit-level parallelism, instruction-level parallelism, data parallelism, and task parallelism. Parallel computing has been used for many years mainly in high performance computing, but in recent times increased interest in them due to the existence of physical restrictions on growth clock frequency processors. Parallel computing has become the dominant paradigm in computer architecture, mostly in the form of multi-core processors.

It is more difficult to write programs for parallel systems than for serial ones, since the competition for resources is new class potential errors in software (bugs), among which the race condition is the most common. Communication and synchronization between processes represent a big barrier to obtaining high performance in parallel systems. AT last years also began to consider the issue of electricity consumption by parallel computers. The nature of the increase in the speed of the program as a result of parallelization is explained by Amdahl's law.

If the calculation does not use cyclic (repetitive) actions, then N computing modules will never perform work N times faster than a single computing module.

For example, to quickly sort an array on a two-processor machine, you can split the array in half and sort each half on a separate processor. Sorting each half may take a different amount of time, so synchronization is necessary.

2. Types of Parallelism

2.1 Bit-Level Parallelism

This form of parallelism is based on increasing the size of a machine word. Increasing the machine word size reduces the number of operations required by the processor to operate on variables whose size is larger than the machine word size. For example: on an 8-bit processor, you need to add two 16-bit integers. To do this, you first need to add the bottom 8 bits of the numbers, then add the top 8 bits and add the value of the carry flag to the result of their addition. A total of 3 instructions. With a 16-bit processor, you can perform this operation with a single instruction.

Historically, 4-bit microprocessors were replaced by 8-bit ones, then 16-bit and 32-bit ones appeared. 32-bit processors have long been the standard in everyday computing. With the advent of x86-64 technology, 64-bit processors began to be used for these purposes.

2.2 Parallelism at the instruction level

A computer program is essentially a stream of instructions executed by a processor. But you can change the order of these instructions, distribute them into groups that will be executed in parallel, without changing the result of the entire program. This technique known as instruction-level parallelism. Advances in the development of instruction-level parallelism in computer architecture occurred from the mid-1980s to the mid-1990s.

Modern processors have a multi-stage instruction pipeline. Each stage of the pipeline corresponds to a certain action performed by the processor in this instruction at this stage. A processor with N pipeline stages can simultaneously have up to N different instructions at different levels of completeness. Classic example A pipelined processor is a RISC processor with 5 stages: fetching an instruction from memory (IF), decoding an instruction (ID), executing an instruction (EX), accessing memory (MEM), writing the result to registers (WB). The Pentium 4 processor has a 35-stage pipeline.

Some processors, in addition to using pipelines, have the ability to execute multiple instructions at the same time, which provides additional parallelism at the instruction level. It is possible to implement this method using superscalar, when instructions can be grouped together for parallel execution (if they do not have a data dependency). Implementations using explicit instruction-level parallelism are also possible: VLIW and EPIC.

2.3 Data parallelism

The main idea of ​​the approach based on data parallelism is that one operation is performed immediately on all elements of the data array. Various fragments of such an array are processed on a vector processor or on different processors parallel machine. The program is responsible for the distribution of data between the processors. Vectorization or parallelization in this case is most often performed already at the compilation stage - the translation of the source code of the program into machine instructions. The programmer's role in this case usually comes down to setting the compiler's vector or parallel optimization settings, parallel compilation directives, and the use of specialized languages ​​for parallel computing.

2.4 Task parallelism (multithreading)

A programming style based on task parallelism implies that a computational task is divided into several relatively independent subtasks and each processor is loaded by its own subtask.

2.5 Distributed operating systems

A distributed OS, by dynamically and automatically allocating work to different machines in the system for processing, makes a set of networked machines work as a virtual uniprocessor. The user of a distributed OS, generally speaking, has no knowledge of the machine on which his work is performed.

A distributed OS exists as a single operating system across a computing system. Each computer on a network running a distributed OS performs part of the functions of this global OS. A distributed OS unites all computers on a network in the sense that they work in close cooperation with each other to effective use all resources of a computer network.

Simple calculations show that configurations of such systems can cost more than one million US dollars - for the sake of interest, estimate how much, say, only 4 TB of RAM cost? A number of natural questions arise: what tasks are so important that multi-million dollar computers are required? Or, what tasks are so difficult that a good Pentium is not enough? We would like to find reasonable answers to these and similar questions.

In order to assess the complexity of the tasks to be solved in practice, we take a specific subject area, for example, optimization of the oil production process. We have an underground oil reservoir with a number of drilled wells: oil is pumped to the surface one by one, water is pumped back through the others. It is necessary to simulate the situation in this reservoir in order to estimate oil reserves or understand the need for additional wells.

Let's take a simplified scheme, in which the simulated area is mapped to a cube, however, it will be enough to estimate the number of necessary arithmetic operations. Reasonable cube sizes for which you can get plausible results are 100 * 100 * 100 points. At each point of the cube, it is necessary to calculate from 5 to 20 functions: three velocity components, pressure, temperature, component concentration (water, gas and oil are the minimum set of components; in more realistic models, for example, various oil fractions are considered). Further, the values ​​of the functions are found as a solution to non-linear equations, which requires from 200 to 1000 arithmetic operations. And finally, if a non-stationary process is being studied, i.e. you need to understand how this system behaves in time, then 100-1000 steps are taken in time. What happened:

10 6 (grid points)*10(functions)*500(operations)*500(time steps) = 2.5*10 12

2500 billion arithmetic operations to perform a single calculation! How about changing model parameters? What about tracking the current situation when the input data changes? Such calculations must be done many times, which imposes very stringent requirements on the performance of the computing systems used.

Examples of the use of supercomputers can be found not only in the oil industry. Here is just a small list of areas of human activity where the use of supercomputers is really necessary:

  • automotive
  • oil and gas production
  • pharmacology
  • weather forecasting and climate change modeling
  • seismic exploration
  • design of electronic devices
  • synthesis of new materials
  • and many, many more

In 1995, the body of the Nissan Maxima was made 10% stronger using a Cray supercomputer (The Atlanta Journal, May 28, 1995). With it, not only the weak points of the body were found, but also the most effective method their removal.

According to Mark Miller (Ford Motor Company), to perform crash tests, in which real cars crash against a concrete wall while simultaneously measuring the necessary parameters, shooting and post-processing the results, Ford would need from 10 to 150 prototypes of new models with total costs ranging from $4 million to $60 million. The use of supercomputers has reduced the number of prototypes by one-third.

A very recent example is the development of one of the world's largest reservation systems, Amadeus, used by thousands of agencies with 180,000 terminals in more than a hundred countries. The installation of two Hewlett-Packard T600 servers with 12 processors each made it possible to increase the degree of operational availability of the central system to 99.85% with the current load of about 60 million requests per day.

And similar examples can be found everywhere. At one time, DuPont researchers were looking for a replacement for chlorofluorocarbon. It was necessary to find a material that has the same positive qualities: non-flammability, corrosion resistance and low toxicity, but without harmful effects to the earth's ozone layer. In one week, the necessary calculations were carried out on a supercomputer with a total cost of about 5 thousand dollars. According to DuPont experts, the use of traditional experimental research methods would require about three months and 50 thousand dollars, and this is without taking into account the time required to synthesize and purify the required amount of substance.

Increasing computer performance, due to what?

Why do supercomputers count so fast? There may be several answers, among which two have a clear advantage: the development element base and the use of new solutions in computer architecture.

Let's try to figure out which of these factors is decisive for achieving record performance. Let's turn to known historical facts. On one of the first computers in the world - EDSAC, which appeared in 1949 in Cambridge and had a cycle time of 2 microseconds (2 * 10-6 seconds), it was possible to perform 2 * n arithmetic operations in 18 * n milliseconds, that is, an average of 100 arithmetic operations operations per second. Let's compare with one computing node of the modern Hewlett-Packard V2600 supercomputer: the cycle time is approximately 1.8 nanoseconds (1.8 * 10-9 seconds), and the peak performance is about 77 billion arithmetic operations per second.

What happens? In half a century, computer performance has grown by more than seven hundred million once. At the same time, the gain in performance associated with a decrease in the cycle time from 2 microseconds to 1.8 nanoseconds is only about 1000 times. Where did the rest come from? The answer is obvious - the use of new solutions in computer architecture. The main place among them is occupied by the principle of parallel data processing, which embodies the idea of ​​simultaneous (parallel) execution of several actions.

Parallel data processing on a computer

Parallel data processing, embodying the idea of ​​simultaneous execution of several actions, has two varieties: pipelined and proper parallelism. Both types of parallel processing are intuitive, so we will only make a small explanation.

Parallel processing. If a certain device performs one operation per unit of time, then it will perform a thousand operations in a thousand units. If we assume that there are five such independent devices that can work simultaneously, then a system of five devices can perform the same thousand operations in not a thousand, but in two hundred units of time. Similarly, a system of N devices will do the same job in 1000/N units of time. Similar analogies can be found in life: if one soldier digs up a garden in 10 hours, then a company of fifty soldiers with the same abilities, working simultaneously, will do the same job in 12 minutes - the principle of parallelism in action!

By the way, academician A.A. Samarsky was a pioneer in parallel processing of data streams. Samarsky solved this problem by putting several dozen young ladies with adding machines at the tables. The young ladies transmitted data to each other simply in words and set aside the necessary numbers on adding machines. Thus, in particular, the evolution of the blast wave was calculated. There was a lot of work, the ladies got tired, and Alexander Andreevich walked between them and encouraged them. This, one might say, was the first parallel system. Although the H-bomb calculations were masterfully carried out, their accuracy was very low, because the nodes in the grid used were few, and the computation time turned out to be too long.

Pipelining. What is needed to add two real numbers represented in floating point form? A whole host of small operations such as order comparison, order alignment, mantissa addition, normalization, etc. The processors of the first computers performed all these "micro-operations" for each pair of arguments sequentially one by one until they reached the final result, and only after that they proceeded to processing the next pair of terms.

The idea of ​​pipeline processing is to single out separate stages of the execution of a common operation, and each stage, having completed its work, would transfer the result to the next one, while simultaneously accepting a new portion of input data. We get an obvious gain in processing speed due to the combination of previously spaced operations. Assume that an operation can be divided into five micro-operations, each of which is performed in one unit of time. If there is one indivisible serial device, then it will process 100 pairs of arguments for 500 units. If each micro-operation is singled out as a separate stage (or in other words, a stage) of a pipeline device, then the first five pairs of arguments will be located at a different stage of processing such a device at the fifth unit of time, and the entire set of one hundred pairs will be processed in 5 + 99 = 104 units time - an acceleration compared to a sequential device by almost five times (according to the number of pipeline steps).

It would seem that pipeline processing can be successfully replaced by ordinary parallelism, for which the main device should be duplicated as many times as the number of pipeline stages is supposed to be allocated. Indeed, the five devices of the previous example will process 100 pairs of arguments in 100 units of time, which is faster than the pipeline device! What's the matter? The answer is simple, by increasing the number of devices five times, we significantly increase both the amount of equipment and its cost. Imagine that a car factory decided to remove the assembly line, while maintaining the pace of car production. If earlier there were a thousand cars on the assembly line at the same time, then, acting by analogy with the previous example, it is necessary to recruit a thousand teams, each of which (1) is able to completely assemble the car from start to finish, performing hundreds of various operations, and (2) do it in the same time that the machine was previously on the assembly line. Imagine the cost of such a car? Not? I agree, it's difficult, unless a Lamborghini comes to mind, but that's why the assembly line was born ...

A Brief History of the Emergence of Parallelism in Computer Architecture

Today, few people are surprised by parallelism in computer architecture. All modern microprocessors, be it Pentium III or PA-8700, MIPS R14000, E2K or Power3, use some form of parallel processing. In the Pentium 4 core, up to 126 micro-operations can be simultaneously in different stages of execution. At presentations of new chips and in corporate press releases, this is presented as the last word in technology and the cutting edge of science, and this is true if we consider the implementation of these principles in the miniature framework of a single crystal.

However, these ideas themselves appeared a very long time ago. Initially, they were introduced in the most advanced, and therefore single, computers of their time. Then, after proper development of technology and cheaper production, they descended into middle-class computers, and finally, today all this is fully embodied in workstations and personal computers.

To make sure that all major innovations in the architecture modern processors in fact, they have been used since the time when neither microprocessors nor the concept of supercomputers existed yet, let's take a short digression into history, starting almost from the moment the first computers were born.

IBM 701 (1953), IBM 704 (1955): bit-parallel memory, bit-parallel arithmetic.
All the very first computers (EDSAC, EDVAC, UNIVAC) had bit-serial memory, from which words were read bit by bit sequentially. The first commercially available computer using bit-parallel memory (on the CRT) and bit-parallel arithmetic was the IBM 701, and the most popular model was the IBM 704 (150 copies sold), which, in addition to the above, was the first to use ferrite memory. cores and hardware AU with a floating point.

IBM 709 (1958): independent I/O processors.
The processors of the first computers handled I/O themselves. However, the speed of the fastest external device, and at that time it was a magnetic tape, was 1000 times less than the speed of the processor, so the processor was actually idle during I / O operations. In 1958 6 independent I / O processors were attached to the IBM 704 computer, which, after receiving commands, could work in parallel with the main processor, and the computer itself was renamed IBM 709. This model turned out to be surprisingly successful, since about 400 copies were sold along with modifications, and the last one was turned off in 1975 - 20 years of existence!

IBM STRETCH (1961): look-ahead, memory layering.
In 1956, IBM signs a contract with the Los Alamos Science Laboratory to develop the STRETCH computer, which has two fundamentally important features: look-ahead for instruction fetching, and memory splitting into two banks to match the low speed of memory access and the speed of operations.

ATLAS (1963): instruction pipeline.
For the first time, the pipeline principle of command execution was used in the ATLAS machine, developed at the University of Manchester. Instruction execution is divided into 4 stages: instruction fetch, operand address calculation, operand fetch and operation execution. Pipelining made it possible to reduce the instruction execution time from 6 µs to 1.6 µs. This computer had a huge impact on both the computer architecture and software: it was the first to use a multiprogram OS based on the use of virtual memory and an interrupt system.

CDC6600 (1964): independent functional units.
Control Data Corporation (CDC), with the direct participation of one of its founders, Seymour R. Cray (Seymour R. Cray), releases the CDC-6600 computer - the first computer that used several independent functional units. For comparison with today, here are some computer parameters:

  • cycle time 100ns,
  • performance 2-3 million operations per second,
  • RAM is divided into 32 banks of 4096 60-bit words,
  • 1µs memory cycle,
  • 10 independent functional units.
The machine was a huge success in the scientific market, actively replacing IBM machines.

CDC7600 (1969): conveyor independent functional devices.
CDC releases the CDC-7600 computer with eight independent pipelined functional units - a combination of parallel processing and pipeline processing. Main parameters:

  • cycle 27.5 ns,
  • 10-15 million operas/sec.,
  • 8 conveyor FU,
  • 2-level memory.

ILLIAC IV (1974): matrix processors.

Design: 256 processor elements (PEs) = 4 quadrants of 64PEs, reconfigurable: 2 quadrants of 128PEs or 1 quadrant of 256PEs, cycle 40ns, performance 1Gflop;

work began in 1967, by the end of 1971 a system of 1 quadrant was made, in 1974. it was put into operation, fine-tuning was carried out until 1975;

central part: control device (CU) + matrix of 64 PE;

  • CU is a simple computer with low performance that controls the PE matrix; all PE matrices worked in a synchronous mode, executing at each moment of time the same command received from the CU, but on their own data;
  • PE had its own ALU with a full set of instructions, OP - 2Kwords of 64 bits, memory cycle 350ns, each PE had direct access only to its own OP;
  • data transfer network: two-dimensional torus with a horizontal shift of 1 along the boundary;

Despite the result in comparison with the project: the cost is 4 times higher, only 1 quadrant was made, the cycle is 80ns, the real performance is up to 50Mflop - this project had a huge impact on the architecture of subsequent machines built on a similar principle, in particular: PEPE, BSP , ICL DAP.

hierarchy of memory.
The memory hierarchy is not directly related to parallelism, however, it certainly refers to those features of computer architecture that are of great importance for improving their performance (smoothing the difference between processor speed and memory access time). Basic levels: registers, cache memory, RAM, disk memory. The access time for memory levels from disk memory to registers decreases, the cost per 1 word (byte) increases. Currently, such a hierarchy is supported even on personal computers.

And what is now used in the world?

In what directions is the development of high-performance computing technology going at the present time? There are four main directions.

Suppose that in your program the proportion of operations that need to be performed sequentially is f, where 0

If 9/10 of the program is executed in parallel, and 1/10 is still sequential, then speedups of more than 10 times cannot be obtained in principle, regardless of the quality of the implementation of the parallel part of the code and the number of processors used (it is clear that 10 is obtained only if when the execution time of the parallel part is 0).

Let's look at the problem from the other side: what part of the code needs to be accelerated (and therefore pre-examined) in order to obtain the given acceleration? The answer can be found in the corollary of Amdahl's law: in order to speed up the execution of the program in q times it is necessary to accelerate not less than q times not less than (1-1/ q)th part of the program. Therefore, if you want to speed up the program by 100 times compared to its sequential version, then you need to get no less speedup for at least 99.99% of the code, which almost always makes up a significant part of the program!

Hence the first conclusion - before you thoroughly rework the code for the transition to a parallel computer (and any supercomputer, in particular, is such), you need to think carefully. If, having evaluated the algorithm embedded in the program, you realized that the proportion of sequential operations is large, then you obviously cannot count on a significant acceleration and you need to think about replacing individual components of the algorithm.

In some cases, the sequential nature of the algorithm is not so difficult to change. Let's say that the program has the following fragment for calculating the sum of n numbers:

S = 0 Do i = 1, n s = s + a(i) EndDo (you can do the same in any other language)

By its nature, it is strictly sequential, since the i-th iteration of the loop requires a result with (i-1)-th and all iterations are performed one by one. We have 100% sequential operations, and therefore no effect from the use of parallel computers. However, the way out is obvious. Since in most real programs (the question is: why in most, and not in all?) there is no significant difference in what order to add numbers, we will choose a different addition scheme. First, find the sum of pairs of adjacent elements: a(1)+a(2), a(3)+a(4), a(5)+a(6), etc. Note that with this scheme, all pairs can be added at the same time! In the next steps, we will act in exactly the same way, obtaining a variant of the parallel algorithm.

It would seem that in this case all problems were solved. But imagine that the processors available to you are heterogeneous in their performance. So there will be a moment when one of them is still working, and someone has already done everything and is uselessly waiting. If the spread in the performance of computers is large, then the efficiency of the entire system with a uniform load of processors will be extremely low.

But let's go further and assume that all processors are the same. Are the problems over? Again no! The processors have done their job, but the result must be transferred to another to continue the summation process ... and the transfer takes time ... and at this time the processors are idle again ...

In a word, to make a parallel computing system or supercomputer work with maximum efficiency on a specific program is, frankly, not an easy task, since it is necessary to carefully coordinate the structure of programs and algorithms with the architecture features of parallel computing systems.

final question. Do you think the following statement is true: more powerful computer, the faster it can solve this problem?

Final answer. No, that's not true. This can be illustrated with a simple everyday example. If one digger digs a hole 1m * 1m * 1m in 1 hour, then two of the same diggers will do it in 30 minutes - you can believe it. And how long will it take 60 diggers to do this work? For 1 minute? Of course not! Starting from a certain moment, they will simply interfere with each other, not speeding up, but slowing down the process. It's the same with computers: if the task is too small, then we will spend more time distributing work, synchronizing processes, compiling results, etc. than directly useful work.

It is clear that not everything is so simple ...

Laboratory of Parallel Information Technologies, Research and Development Center of Moscow State University

Send your good work in the knowledge base is simple. Use the form below

Good work to site">

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

MINISTRY OF EDUCATION AND SCIENCE OF THE REPUBLIC OF KAZAKHSTAN

North - Kazakhstan State University. M. Kozybayeva

Faculty of Information Technology

Department of Information Systems

Data Parallel Process

Completed by: Makhkambaeva A.S.

Checked by: Kasimov I.R.

Petropavlovsk, 2014

Introduction

In uniprocessor systems, the so-called pseudo-parallelism takes place - although at each moment of time the processor is busy processing one specific task for another, the illusion of parallel execution of several tasks is achieved. In multiprocessor systems, the task of maximizing the use of each specific processor is also solved by switching between processes, but here, along with pseudo-parallelism, there is also real parallelism, when different processes are executed on different processors at the same time.

The idea of ​​data processing parallelization is based on the fact that most tasks can be divided into a set of smaller tasks that can be solved simultaneously. Processes, the execution of which at least partially overlaps in time, are called parallel.

In 1967, Gene Amdahl formulated the law of limiting the growth of performance in parallel computing: “In the case when a task is divided into several parts, the total time of its execution on a parallel system cannot be less than the time of execution of the longest fragment.” According to this law, speeding up the execution of a program by parallelizing its instructions is limited by the time required to execute its sequential instructions.

Flynn classification

process synchronization access planning

The classification is based on two concepts: command streams and data streams. A system with N processors has N program counters and therefore N instruction streams.

Command streams

Data streams

Titles

SISD (Single Instruction, Single Data) is a computer architecture in which one processor executes one instruction stream operating on one data stream. For this class, only pseudo-parallelism is possible.

SIMD (Single Instruction, Multiple Data) is a computer architecture that allows parallelism at the data level. The main idea of ​​the approach based on data parallelism is that one operation is performed immediately on all elements of the data array. These systems typically have a large number of processors, from 1024 to 16384, that can execute the same instruction created by a single control unit on different data. At any moment, the same instruction is being executed in each processor, but different data is being processed. A synchronous parallel computing process is implemented.

MISD (Multiple Instruction, Simple Data) is a computer architecture where several functional modules (two or more) perform various operations on the same data. Fault-tolerant computers that execute the same commands redundantly in order to detect errors, as the definition implies, belong to this type.

MIMD (Multiple Instruction, Multiple Data) is a computer architecture where several independent processors work as part of a large system. Processing is divided into several threads (parallelism is provided), each with its own processor hardware state, within a single defined software process or within multiple processes.

There are two subclasses of MIMD systems: shared memory systems and distributed memory systems. For systems of the first type, it is characteristic that any processor has direct access to any cell of this common RAM. Distributed memory systems are usually a combination of computer nodes. A node is an independent processor with its own local RAM. In these systems, any processor cannot randomly access the memory of another processor.

OpenMP (Open Multi-Processing) is an open standard for parallelizing C, C++ and Fortran programs. Describes a set of commands that are intended for programming multithreaded applications on multiprocessor systems with shared memory. OpenMP implements parallel computing using multithreading, in which a "master" thread creates a set of slave threads and the task is distributed among them.

Tasks performed by threads in parallel, as well as the data required to perform these tasks, are described using special directives of the preprocessor of the corresponding language - pragmas. The C program must include the file "omp.h".

The next loop adds the arrays "a" and "b" element by element. All that is required for parallel execution in this case is one pragma inserted just before the loop.

#pragma omp parallel for

for (i=0; i< numPixels; i++)

c[i] = a[i]+b[i];

This example uses "load balancing", a generic term used in OpenMP to describe workload distribution among threads. If load balancing is applied with the for directive, as shown in the example, the iterations of the loop are distributed among multiple threads so that each iteration of the loop is executed only once, in parallel by one or more threads. OpenMP determines how many threads should be created and also the best way creating, synchronizing and destroying threads. All the programmer has to do is tell OpenMP which loop to parallelize.

Load balance (distributing the workload equally among threads) is one of the most important attributes of an application's parallel execution. Without it, some threads may complete their work much earlier than others, which leads to idle computing resources and loss of performance.

By default, OpenMP assumes that all loop iterations take the same amount of time. As a result, OpenMP distributes loop iterations between threads approximately equally and in such a way as to minimize the possibility of memory conflicts due to improper sharing.

#pragma omp parallel for

for (i=2; i< 10; i++)

factorial[i] = i * factorial;

If the loop meets all the restrictions and the compiler has parallelized the loop, this is not guaranteed to work correctly because there may be a data dependency.

A data dependency exists if different iterations of the loop (more specifically, an iteration that is running on a different thread) read or write shared memory.

MPI (Message Passing Interface) is a software interface for information transfer that allows you to exchange messages between processes that perform the same task. First of all, MPI is focused on systems with distributed memory. There are implementations for Fortran, C and C++.

In the first version of MPI, the number of processes (branches) is set at the moment the program starts, i.e. there is no way to spawn branches dynamically. In version 2.0, this feature appeared.

When an application starts, all of its child branches form a branch group (an ordered set of branches). Each group is associated with a "communication field" that describes all participants in the data exchange and the data common to all participants. Switches are used to describe the communication field. All data exchange operations can only occur within one communication field (this is ensured by checking the switches).

For C, the general format is

rc = MPI_Xxxxx(parameter, ...);

Note that case is important here. For example, MPI must be capitalized, as is the first letter after an underscore. All subsequent characters must be in lower case. The rc variable is a return code of an integer type. If successful, it is set to MPI_SUCCESS. The C program must include the file "mpi.h".

MPI messages consist of two main parts: data to be sent/received, and accompanying information (envelope/envelope/ entries) that helps send the data along a particular route.

Data corresponds to buffer start, number, data type. A buffer is simply memory that the compiler has allocated for a variable (often an array) in your program. Buffer start - the address where the data starts. For example, the beginning of an array in your program. Number - the number of elements (not bytes!) of data in the message. The data type determines the size of a single element.

The information "on the cover" includes the rank in the communicator - the process ID in the communication field, the tag - arbitrary number, which helps to distinguish between messages and the communicator itself, the check of which ensures transmission within one communication field.

Parallel data processing

There are several ways to divide responsibilities between processes:

* delegation ("manager-worker");

* peer-to-peer network;

* conveyor;

* "manufacturer-consumer".

Each model has its own work breakdown structure, which defines who is responsible for creating threads and under what conditions they are created.

In the delegation model, one thread (the "master") creates threads (the "workers") and assigns a task to each of them. The control thread needs to wait until all threads have completed their tasks. The manager thread delegates the task that each worker thread must perform by setting some function. Together with the task, the workflow is also responsible for its implementation and obtaining results. In addition, at the stage of obtaining results, it is possible to synchronize actions with the control (or other) thread.

If the delegation model has a manager thread that delegates tasks to worker threads, then in the peer model, all threads have the same work status. Although there is a single thread that initially creates all the threads needed to complete all tasks, this thread is considered a worker thread, but it does not perform any task delegation functions. In this model, there is no centralized thread, but worker threads have a lot of responsibility. All peer threads can process requests from the same input stream, or each worker thread can have its own input stream for which it is responsible. Worker threads may need to communicate and share resources.

The conveyor model is similar to an assembly line in that it assumes a flow of items that are processed in stages. At each stage, a separate thread performs some operation on a certain set of input data. When this set of data has passed all stages, the processing of the entire input data stream will be completed. This approach allows you to process multiple input streams at the same time. Each thread is responsible for getting intermediate results, making them available to the next stage (or next thread) of the pipeline. The last stage (or thread) generates the results of the pipeline as a whole.

In the producer-consumer model, there is a producer thread that prepares the data to be consumed by the consumer thread. The data is stored in a block of memory shared between the producer and consumer threads. The producer thread must first prepare the data, which the consumer thread will then receive. This process needs synchronization. If the producer thread delivers data much faster than the consumer thread can consume it, the producer thread will overwrite the results it received earlier several times before the consumer thread can process them. But if the consumer thread receives data much faster than the producer thread can supply it, the consumer thread will either process the data it has already processed again or try to accept data that has not yet been prepared.

Synchronous and asynchronous processes

Synchronous processes - processes with interleaved execution, when one process suspends its execution until another one has completed. For example, process A, the parent, when executed, creates process B, the child. Process A suspends its execution until process B terminates. When process B terminates, its output code is placed in the process table. This notifies process A that process B has terminated. Process A can continue executing and then terminate or terminate immediately.

Asynchronous processes run independently of one another. This means that process A will run to the end regardless of process B. There may or may not be direct parent-child relationships between asynchronous processes. If process A creates process B, they can both run independently, but at some point the parent must acquire the exit status of the child process. If the processes are not directly related, they can have a common parent.

Asynchronous processes can share resources such as files or memory. This may or may not require synchronization or interaction when sharing resources.

Synchronization of processes - bringing several processes to such a course, when certain stages of different processes are performed in a certain order, or simultaneously.

Synchronization is necessary in any case where parallel processes need to interact. For its organization, the means of interprocess communication are used. Among the most commonly used facilities are signals and messages, semaphores and mutexes, pipes, shared memory.

Inter-Process Communication

One of the solutions to the problems of synchronization of access to critical resources is to disable all interrupts immediately after the process enters the critical section and enable them just before exiting it. If interrupts are disabled, then process switching does not occur, since the transfer of control to the scheduler can only be implemented using interrupts.

This approach, however, has a number of significant drawbacks. There is no guarantee that the process that disabled interrupts will not loop in its critical section, thereby bringing the system to a completely inoperable state. In addition, this method is not suitable for a multiprocessor system, since disabling interrupts on one of the processors does not affect the execution of processes on other processors in the VS, and these processors still have access to the shared resource.

Message - a method of interaction when one process sends a message to the second, and he receives it. If the message has not arrived, the second process is blocked (waiting for a message) or immediately returns an error code.

There are many problems associated with communication systems. For example, the message may be lost. To avoid loss, the recipient sends back an acknowledgment message. If the sender does not receive an acknowledgment after a while, it sends the message again.

Now imagine that the message is received, but the confirmation did not reach the sender. The sender will send it again and it will reach the recipient twice. It is essential that the recipient be able to distinguish between a copy of a previous message and a new one. This is easily solved by embedding the message number in its body.

A semaphore is an object that allows no more than n processes to enter a given section of code (usually a critical section).

Three operations are possible with a semaphore:

1) init(n); - counter initialization (the number passed to the counter is the number of processes that can simultaneously access the critical section)

2) wait(); - wait until the counter becomes more than 0; then decrement the counter by one.

3) leave(); - increase the counter by one.

Before a process accesses a critical section, it is necessary to call the wait() method, after which it is guaranteed that the number of processes simultaneously accessing it does not exceed n-1. The process can then continue and execute the leave() method after working with the critical section, thereby letting other processes know that “the space has been freed”.

If the number of calls to the wait() and leave() methods does not match, then the system will not work correctly in the same way as in the case of process deadlock - a situation in which several processes are in a state of endless waiting for resources occupied by these processes themselves:

Process 1

Process 2

Wants to capture A and B, starts with A

Wants to capture A and B, starts with B

Seizes resource A

Seizes resource B

Waiting for resource B to become free

Waiting for resource A to become free

Deadlock

Debugging deadlocks, as well as other synchronization errors, is complicated by the fact that for their occurrence, specific conditions for the simultaneous execution of several processes are needed (in the above example, if process 1 had managed to grab resource B before process 2, then the error would not have occurred).

Mutexes are the simplest binary semaphores that can be in one of two states - marked or unmarked (open and closed respectively). When any thread belonging to any process becomes the owner of the mutex object, the latter is put into the unmarked state. If the task releases the mutex, its state becomes marked.

The job of a mutex is to protect the object from being accessed by other threads than the one that owns the mutex. At any given moment, only one thread can own an object protected by a mutex. If another thread needs access to a variable protected by a mutex, then that thread will sleep until the mutex is released.

Test-and-set is a simple non-breakable (atomic) processor instruction that copies the value of a variable into a register and sets some new value. During the execution of this instruction, the processor cannot interrupt its execution and will switch to the execution of another thread. If a multiprocessor architecture is used, then while one processor is executing this instruction on a memory location, other processors cannot access that location.

Dekker's algorithm is the first known correct solution mutual exclusion problems in competitive programming. It allows two threads of execution to share an unshared resource without conflicts, using only shared memory for communication.

If two processes try to enter the critical section at the same time, the algorithm will only allow one of them to do so, based on whose turn it is at that moment. If one process has already entered the critical section, the other will wait until the first one leaves it. This is done by using two flags (indicators of "intention" to enter the critical section) and a turn variable (indicating which process' turn it is).

One of the advantages of the algorithm is that it does not require special Test-and-set instructions and, as a result, it is easily portable to different languages programming and computer architecture. The disadvantages are its applicability to the case with only two processes and the use of Busy waiting instead of suspending the process (using busy waiting implies that the processes should minimal amount time inside the critical section).

The Peterson algorithm is a software algorithm for mutual exclusion of code execution threads. Although originally formulated for the 2-thread case, the algorithm can be generalized to an arbitrary number of threads. The algorithm is conditionally called programmatic, since it is not based on the use special teams processor to disable interrupts, block the memory bus, etc., only shared memory variables and a loop are used to wait for the entry into the critical section of the executable code.

Before executing a critical section, a thread must call special procedure(let's call it EnterRegion) with its number as a parameter. It must arrange for a thread to wait for its turn to enter the critical section. After executing the critical section and leaving it, the thread calls another procedure (let's call it LeaveRegion), after which other threads will be able to enter the critical region.

The general principle of the Peterson algorithm for 2 threads:

Hosted at http://www.allbest.ru/

Process planning

Scheduling - ensuring sequential access of processes to one processor.

The scheduler is the part of the operating system responsible for this.

Non-preemptive scheduling algorithm (non-preemptive) - does not require a hardware timer interrupt, the process stops only when it blocks or exits.

Preemptive scheduling algorithm (priority) - requires a hardware timer interrupt, the process runs only for the allotted time period, after which it is suspended by the timer to transfer control to the scheduler.

Processes are placed in priority queues according to the Scheduling strategy. UNIX/Linux systems use two scheduling strategies: FIFO (short for First In First Out) and RR (short for round-robin, cyclic).

When using the FIFO strategy, processes are assigned to the processor according to the time they entered the queue.

RR scheduling is the same as FIFO scheduling with one exception: after the time slice expires, the process is placed at the end of its priority queue, and the next (in order) process is assigned to the processor.

Priority scheduling may be appropriate to ensure that processes run in parallel. Each process is assigned a priority, and control is transferred to the process with the highest priority. Priority can be dynamic or static. The dynamic priority can be set as follows: P=1/T, where T is the part of the last used quantum (if 1/50 of the quantum is used, then the priority is 50. If the entire quantum is used, then the priority is 1).

Often, processes are prioritized into groups, and use priority scheduling among groups, but within the group, round robin scheduling is used.

Featured on Allbest.ur

Similar Documents

    Structure, specificity and architecture of multiprocessor systems; Flynn classification. Organization of mutual exclusion to synchronize access to shared resources. Disable interrupts; semaphores with device drivers. Load distribution clusters.

    term paper, added 06/07/2014

    Management of the main and secondary memory of the computer. User access to various public network resources. Command interpreter support system. Allocation of resources between users, programs and processes running at the same time.

    presentation, added 01/24/2014

    Improving the parameters of memory modules. The functioning and interaction of the operating system with RAM. Analysis of the main types, parameters of RAM. The software part with the processing of command execution and placement in RAM.

    term paper, added 12/02/2009

    The main functions and processes of the process control subsystem. Dispatching processes (threads). Thread execution scheduling algorithms. Appointment and types of priorities in operating systems. Functions of the main memory management subsystem.

    presentation, added 12/20/2013

    Abstract models and methods of parallel data processing, allowable calculation error. The concept of a parallel process, their synchronization and parallelization granules, the definition of Amdahl's law. Architecture of multiprocessor computing systems.

    thesis, added 09/09/2010

    Writing a program that implements the operation of a multiprocessor system with shared memory, which processes queues of requests of variable length. Analysis of the typical architecture of a multiprocessor system. Description of procedures and variables used in the program.

    term paper, added 06/21/2013

    Advantages of multiprocessor systems. Creation of a program that implements the operation of a multiprocessor system with a shared memory for processing a different number of requests, as well as a different number of processors. Models of calculations on vector and matrix computers.

    term paper, added 06/21/2013

    Process management is a part of the operating system that affects the functioning of a computer. The context is a process descriptor and its scheduling algorithm. Means of synchronization and interaction of processes. Critical section, deadlocks and threads.

    lecture, added 02/05/2009

    The essence and content of the basic concepts of operating systems: processes, memory, files. Classification according to various criteria and types of processes, directions of interconnection. Processor scheduling principles. The order of non-virtual memory management.

    presentation, added 07/24/2013

    Classification of parallel aircraft. Systems with shared and distributed memory. Operations pipelines. The performance of an ideal conveyor. superscalar architectures. VLIW architecture. Transition prediction. matrix processors. Laws of Amdahl and Gustafson.

Under the term parallel processing we will understand the simultaneous execution of tasks, steps (items) of tasks, programs, subroutines, loops, operators and commands. Parallel processing of information can be used for two main purposes:

1. Increasing the performance of computers and aircraft not by improving the element base, but by effectively organizing computing processes.

2. Ensuring the high reliability of the aircraft through duplication of computing equipment.

Rice. 5.1. Concurrency levels

Improving the performance of computers and VS is the main goal of applying parallel processing, for this reason, such computers as multiprocessor servers, mainframes and supercomputers have a parallel architecture.

Parallel processing of information can be carried out at several levels (Fig. 5.1).

Obviously, the lower the level, the finer the fragmentation of software processes, the finer, as they say, " seed of parallelism". In the general case, it is possible to implement parallelism both at a separate level and at several at the same time. Independent single-processor processing implements parallelism at level 1. Vector processing consists in parallel execution of cycles at level 2 and can be performed on one or several processors. Layers 3 and 4 correspond to multiprocessor VS. Level 5 parallelism is typical for multimachine computing systems.



There are two main ways to organize parallel processing:

Combination in time of the stages of solving different problems;

Simultaneous solution of various tasks or parts of one task;

First way- the combination in time of the stages of solving different problems is multiprogram processing information. Multiprogram processing has long been widely used to improve the performance of computers and aircraft. A detailed discussion of multiprogram processing belongs to the topic "Operating Systems" and is beyond the scope of this tutorial.

Second way- simultaneous solution of various tasks or parts of one task - is possible only if there are several processing devices. In this case, certain features of tasks or task flows are used, which allows parallelization.

We can single out the following types of parallelism, which make it possible to implement the algorithmic features of individual tasks and their flows.

1. Natural parallelism of independent tasks.

2. Parallelism of objects or data.

3. Parallelism of task or program branches.

Let's consider these types of parallelism.

1. Natural parallelism of independent problems lies in the fact that the input of the CS receives a continuous stream of unrelated tasks, i.e. the solution of any problem does not depend on the results of solving other problems. In this case, the use of several processing devices for any method of integration (combining into a system) increases the performance of the system.

A typical example of natural parallelism is the incoming user requests to an informational web site. Each request generates a separate procedure for its execution, which does not depend on other similar procedures.

2. Parallelism of objects or data occurs when one and the same (or almost the same) program must process a certain set of data entering the system at the same time.

These can be, for example, the tasks of processing signals from a radar station: all signals are processed according to the same program. Another example is the processing of information from sensors that simultaneously measure the same parameter and are installed on several objects of the same type.

Programs of this type can be of various sizes and complexity, ranging from very simple ones containing a few operations to large programs with hundreds and thousands of operations. In this case, the parallelism of the execution of operations is achieved by increasing the number of processing devices, each of which is capable of autonomously executing a sequence of commands on a separate set of data. Often the main feature of such programs (in particular, programs for processing vectors and matrices) is that the same instruction must be performed on a large set of elementary, interconnected data in some way, and the corresponding operation can be performed on all data simultaneously. At the same time, the time for solving the problem is reduced in proportion to the number of processing devices.

3. Parallelism of task or program branches- one of the most common types of parallelism in information processing. It lies in the fact that when solving one task, its separate parts can be distinguished - branches, which, if there are several processing devices, can be executed in parallel. At the same time, only independent branches tasks, i.e. parts of it for which the following conditions are met:

None of the output values ​​of these branches of the problem is an input value for another such branch (lack of functional connections);

· the conditions for the execution of one branch do not depend on the results or signs obtained during the execution of other branches (independence in control).

A good idea of ​​branch parallelism is given by the tiered parallel form (LFP) of the program, an example of which is shown in Fig. 5.2.

The program is presented as a set of branches located at several levels - tiers. Branches are marked with circles with numbers inside. The length of a branch is represented by a number next to the circle that tells how many time units the branch is running. The arrows show the input data and processing results. The input data is denoted by the symbol X, the output data - by the symbol Y. Symbols X have subscripts indicating the numbers of input quantities; Y symbols have numeric indices both at the bottom and at the top; the figure at the top corresponds to the number of the branch, during the execution of which this result was obtained, and the figure below indicates the ordinal number of the result obtained during the implementation of this branch of the program. One tier contains independent branches of the problem that are not related to each other, i.e. the results of solving any branch of a given tier are not input data for another branch of the same tier.

Rice. 5.2. An example of a tiered-parallel form of the program

Shown in fig. 5.2 the program contains 9 branches located on 3 tiers. On the example of this, in general, a fairly simple program, it is possible to identify the advantages of a computing system that includes several processing devices, and the problems that arise in this case.

Let's assume that the length i-th branch is represented by the number of time units t i required for its execution. Then it is easy to calculate that it will take time to execute the entire program on 1 processor T1:

T1=S (10+20+15+30+55+10+15+25+15)=195

If we imagine that the program is executed by two processing devices (processors) that work independently of each other, then the time for solving the problem will be reduced. However, this time, as it is easy to see, will be different depending on the sequence of execution of independent branches.

Consider, for example, such a variant of the execution of the program shown in Fig. 5.2. Let processor 1 execute branches 1-3-4-6-7-9 and processor 2 execute branches 2-5-8. On fig. 5.3 shows the timing diagrams of the execution of program branches by processors.

Rice. 5.3. Decomposition of program branches by 2 processors

It is easy to calculate that processor 1 spends 105 and processor 2 spends 100 units of time. In this case, there are two periods of time when one of the processors is forced to idle - P1 with a duration of 10 units and P2 with a duration of 5 units of time. Gap P1, during which only processor 2 works, was formed due to the fact that branch 7 depends on branch 5 (by the time branch 6 is completed, data Y 5 1 is not yet ready). Gap P1, during which only processor 1 works, was formed due to the end of the count by processor 2.

Thus, on a system of two processors, our program will be completely executed in no less than 105 units of time. The value that characterizes the decrease in the time for solving a problem on several processors compared to using a single processor is called acceleration of the count S and calculated as

The parallelization coefficient varies from 0 to 1 (from 0 to 100%) and reflects the efficiency of using computing resources. In our example, it is easy to calculate that the acceleration S= 195/105 = 1.86, and the parallelization coefficient K p= 0.93. As we can see, due to the downtime of one of the processors, the counting acceleration is much less than 2, i.e. the number of processors used. Note that our example did not take into account the time delays associated with switching program contexts (changing branches) and transferring data from one branch to another. However, due to the algorithmic features of the program, part of the calculations in the intervals P1 and P2 is performed by only one processor, i.e. actually sequential.

Let us consider a generalized case of a program in which algorithmically the share sequential computing(the ratio of the time of successive calculations to the total time of the program calculation) is a certain value f. In this case, the program execution time on the system from p processors cannot be less than

This ratio is called Amdal law. On the example of the program in Fig. 5.2 we can see that the proportion of sequential computations is f= 15/195. Substituting this value into the formula of Amdahl's law, we obtain a maximum acceleration of 1.86 times for a system of two processors, which corresponds to the previously calculated value.

To illustrate the operation of Amdahl's law, we present next example. Let the share of sequential calculations in some program be 10%. Then the maximum computation speedup on 100 processors will not exceed 9.2. The parallelization coefficient will be only 9.2%. On 10 processors, the acceleration will be 5.3, and the parallelization coefficient will be 53%. It is easy to see that even such a small proportion of sequential calculations, already at the theoretical level, without taking into account the inevitable delays in a real CS, seriously limits the scalability of the program.

Determine what should be the maximum share f sequential calculations in the program, so that it is possible to obtain a predetermined acceleration of the calculation S with the maximum parallelization coefficient K p. To do this, we express the proportion of sequential calculations from the Amdahl law:

Relation (5.6) defines a very important consequence from the law of Amdala. In order to speed up the program inq times, it is necessary to accelerate not less thanq times no less than () -th part of the program. For example, to get a 100x speedup, you need to parallelize 99.99% of the entire program.

In addition to algorithmic parallelization, in order to solve a problem with parallel branches with the help of several processing devices, an appropriate organization of the process is necessary, which determines the ways of solving the problem and develops necessary information about the readiness of each branch. However, all this is relatively easy to implement when the duration of the execution of each branch is known quite accurately. In practice, this is extremely rare: at best, there is one or another time estimate. Therefore, the organization of an optimal or close to optimal work schedule is a rather difficult task.

It should also be noted that certain difficulties associated with the allocation of independent branches in the development of programs. At the same time, when solving many complex problems, only programming with the selection of independent branches can significantly reduce the solution time. In particular, problems of matrix algebra, linear programming, spectral signal processing, direct and inverse Fourier transforms, etc., lend themselves well to parallel processing of this type of problem.







2022 gtavrl.ru.