Supermachines: the history of the development of supercomputers. Supercomputers in Russia

Youtube

Supercomputers: past, present and future

The term “supercomputer” was first used in the early 60s, when a group of specialists from the University of Illinois (USA) under the leadership of Dr. D. Slotnik proposed the idea of implementing the world's first parallel computing system. The project, called SOLOMON, was based on the principle of vector processing, which was formulated by J. von Neumann, and the concept of matrix parallel architecture, proposed by S. Unger in the early 50s.

The fact is that most supercomputers demonstrate amazing performance thanks to this particular (vector) type of parallelism. Any programmer developing programs in familiar languages high level, I've probably come across so-called DO loops more than once. But few people have thought about the productivity potential of these commonly used operators. A well-known specialist in the field of programming systems, D. Knuth, showed that DO loops occupy less than 4% of the code of programs in the FORTRAN language, but require more than half of the computing time of the task.

The idea of vector processing of loops of this kind is that a vector operation is introduced into the computer command system, which works with all elements of the operand vectors. In this case, two possibilities for accelerating calculations are realized at once: firstly, the number of object code commands executed by the processor is reduced, since there is no need to recalculate indices and organize a conditional jump, and, secondly, all operations of adding elements of operand vectors can be performed simultaneously due to parallelism of processing.

It is important to note another feature of vector processing related to the number of elementary loop operations: the more parallel operations are included in the vectorized loop, the greater the gain in computation speed, since the share of unproductive time spent on fetching, decoding and launching the execution of a vector command is reduced.

The first supercomputer to take advantage of vector processing was ILLIAC IV (SIMD architecture). In the early 60s, a group of the same Slotnik, united in the Center for Advanced computing technologies at the University of Illinois, began the practical implementation of a vector supercomputer project with a matrix structure. The production of the machine was undertaken by Burroughs Corp. The technical side of the project is still striking in its scale: the system was supposed to consist of four quadrants, each of which included 64 processing elements (PE) and 64 memory modules, united by a switch based on a hypercube network. All quadrant PEs process the vector instruction sent to them by the command processor, each performing one elementary vector operation, the data for which is stored in the memory module associated with this PE. Thus, one quadrant of ILLIAC IV is capable of simultaneously processing 64 vector elements, and the entire system of four quadrants is capable of processing 256 elements. In 1972, the first ILLIAC IV system was installed at NASA Ames Research Center. The results of its operation in this organization received mixed assessments. On the one hand, the use of a supercomputer made it possible to solve a number of complex aerodynamic problems that other computers could not cope with. Even the fastest computer for scientific research of that time - Control Data CDC 7600, which, by the way, was designed by the “patriarch of supercomputers” Seymour Cray, could provide performance of no more than 5 MFLOPS, while ILLIAC IV demonstrated an average performance of approximately 20 MFLOPS. On the other hand, ILLIAC IV was never brought to full configuration of 256 PEs; In fact, the developers limited themselves to only one quadrant. The reasons were not so much technical difficulties in increasing the number of processor elements of the system, but problems associated with programming data exchange between processor elements through a memory module switch. All attempts to solve this problem using a system software failed, resulting in each application requiring manual programming of the switch's transmissions, which led to unsatisfactory user reviews.

If the developers of ILLIAC IV managed to overcome the problems of programming the matrix of processor elements, then, probably, the development computer technology would have gone a completely different way and today computers with matrix architecture would dominate. However, neither in the 60s, nor later, a satisfactory and universal solution to two such fundamental problems as programming the parallel operation of several hundred processors and at the same time ensuring a minimum of computational time spent on data exchange between them was found. It took about another 15 years of efforts by various companies to implement supercomputers with matrix architecture to make the final diagnosis: computers of this type are unable to satisfy a wide range of users and have a very limited scope, often within one or a few types of tasks.

As the means of ultra-high-speed data processing are mastered, the gap between the improvement of program vectorization methods, i.e. automatic conversion of sequential language constructs into vector form during the compilation process, and the extreme complexity of programming the switching and distribution of data between processor elements led to a rather harsh reaction from users in relation to matrix supercomputers - a wide range of programmers required a simpler and more “transparent” vector processing architecture with the ability to use standard high-level languages like FORTRAN. The solution was found in the late 60s, when the Control Data company, with which Cray collaborated at that time, introduced the STAR-100 machine, based on the vector-pipeline principle of data processing. The difference between vector-pipeline technology and the architecture of matrix computers is that instead of many processor elements executing the same command on different elements vector, a single conveyor of operations is used, the operating principle of which fully corresponds to the classic conveyor of Ford automobile plants. Even a supercomputer as archaic by modern standards as the STAR-100 showed maximum performance at the level of 50 MFLOPS. It is significant that vector-pipeline supercomputers are much cheaper than their matrix “relatives”. For example, the development and production of ILLIAC IV cost $40 million with operating costs of about $2 million per year, while the market value of the first supercomputers from CRAY and Control Data was in the range of $10–15 million. depending on the amount of memory, the composition of peripheral devices and other features of the system configuration.

The second significant feature of the vector-pipeline architecture is that the operations pipeline has only one input, through which operands arrive, and one output of the result, whereas in matrix systems there are many data inputs to the processor elements and many outputs from them. In other words, in computers with pipeline processing, the data of all concurrently executed operations is selected and written into a single memory, and therefore there is no need for a switch of processor elements, which has become a stumbling block in the design of matrix supercomputers.

The next blow to the positions of supercomputers with matrix architecture was dealt by two machines from Control Data Corp. - CYBER-203 and CYBER-205. The peak performance of the first was 100, and the second - already 400 MFLOPS.

CRAY-1 makes a revolution The STAR-100 vector-conveyor supercomputer and the CYBER-200 series machines, figuratively speaking, were only a “knockdown” for the matrix architecture. The knockout blow came in 1974, when Cray, who had by then left the CDC corporation and founded his own company, Cray Research, announced the release of CRAY-1, a vector-pipeline supercomputer that became an epoch-making event in the world of computing. This small-sized machine (its height is slightly higher than the average human height, and the area occupied by the processor is slightly more than 2.5 sq.m.) had a performance of 160 MFLOPS and a RAM capacity of 64 MB. After a short trial operation at the Los Alamos laboratory, where the new product received the highest reviews from programmers and mathematicians, Cray Research launched serial production of CRAY-1 machines, which were actively sold out in the United States. It is curious that the US administration duly appreciated the strategic value of CRAY-1 and controlled the supply of this computer even to friendly states. The appearance of CRAY-1 aroused interest not only among users in need of ultra-high-speed data processing, but also among specialists in supercomputer architecture. What was unexpected for many (and even unpleasant for the developers of CYBER-205) was the fact that with most tasks small computer CRAY-1 performed faster than the CYBER-205, which is significantly superior in size and peak performance. Thus, when testing on the linear equation solving package LINPACK, Jack Dongarra from the Argonne National Laboratory estimated the performance of CRAY-1S in the range of 12 - 23 MFLOPS, depending on the programming method, while CYBER-205 showed performance of only 8.4 MFLOPS. An explanation was found as soon as we remembered G. Amdahl’s law, which the famous architect of the IBM/360 system formulated in 1967 in the form of the following postulate: “The performance of a computing system is determined by its slowest component.” In relation to vector supercomputers, Amdahl's paradox is refracted as follows. Any task performed in a supercomputer consists of two interrelated parts - vector instructions generated by the compiler when vectorizing the source program, and scalar operations that the compiler was unable to translate into vector form. If we imagine a supercomputer that can perform scalar and vector operations equally quickly, then Amdahl’s paradox “does not work” and such a system will perform tasks of any degree of vectorization with equal speed. But it goes without saying that scalar processing takes longer, plus the CRAY-1, with a cycle of 12.5 ns, has greater scalar processing speed compared to the CYBER-205 computer, which has a cycle of 20 ns.

SUPERCOMPUTERS: History and modernity.

1. Introduction

I study in an information and mathematics class, where one of the main directions is the study of computer science. We study computers and their structure, the basics of programming, and the areas of application of computers. Every schoolchild today is familiar with a computer and uses cell phone, whose operating systems are improving day by day. I was interested in the question: is there a limit to this perfection? Which machine is the fastest today, and what prospects await us in this direction?

2. A little history

The world's first computer appeared in 1943 during World War II. Inventors from Great Britain called it “Colossus”, and it was intended to decode the German Enigma encryption machine. "COLOSUS" consisted of 2000 vacuum tubes and worked at a fantastic speed, processing about 25,000 characters per second.

Time passed, progress did not stand still and newer, faster, less cumbersome computers were invented by people. And today a personal computer, which almost everyone has at home, can do much more than its predecessors.

Since the advent of the first computers, one of the main problems facing developers has been the performance of the computing system. During development computer industry Processor performance has increased rapidly, but the emergence of increasingly sophisticated software, the growth in the number of users and the expansion of the scope of applications of computing systems place new demands on the power of the equipment used, which led to the emergence of supercomputers.

3.Supercomputer, flops and Moore's law.

What are supercomputers? Supercomputer(English) supercomputer, Supercomputer) - This Calculating machine, significantly superior in its technical parameters to most currently existing computers. It allows you to perform complex calculations in shorter periods of time. This is what the prefix “Super” actually means (Super translated from English means: above, above). Any computer system consists of three main components - a central processor, that is, a counting device, a memory unit and a secondary information storage system. Of key importance are not only the technical parameters of each of these elements, but also the throughput of the channels connecting them with each other and with consumer terminals.

An important indicator of computer performance is its speed. It is measured by so-called flops - from the English abbreviation, ( Fl oating point OP erations per S second , pronounced flops) is a non-system unit used to measure computer performance, indicating how many operations per second a given computing system performs. That is, the basis is taken to calculate how many of the most complex calculations the machine can perform in one moment.

The beginning of the era of supercomputers can perhaps be called 1976, when the first vector system Cray 1 appeared. Working with a limited set of applications at that time, Cray 1 showed so impressive compared to conventional systems results that deservedly received the name “supercomputer” and determined the development of the entire high-performance computing industry for many years.

Cray-1 was the fastest at that time. The Cray-1 memory was 8 MB, divided into 16 blocks, with a total access time of 12.5 ns. There was also external memory on magnetic disks with a capacity of about 450 MB, expandable to 8 GB. An optimizing Fortran translator, a macro assembler, and a special multitasking OS were created for the machine.

Over the past 15 years, supercomputer performance standards have changed several times. According to the 1986 Oxford Dictionary of Computing definition, in order to receive the proud title of “super computer”, a machine needed to have a performance of 10 megaflops (millions of operations per second). In the early 90s, the 200 megaflop mark was surpassed, then 1 gigaflop .

All computers on planet Earth obey Moore's law: their productivity doubles every year and a half. In 1965, Gordon Moore, one of the founders of Intel, discovered the following pattern: the appearance of new chip models was observed about a year after their predecessors, and the number of transistors in them approximately doubled each time. Moore concluded that if this trend continues, the power of computing devices could grow exponentially in a relatively short period of time. This observation is called Moore's law. The entire development of the electronics industry over the past 45 years only confirms the correctness of this law.

4.Use of supercomputers

Why do we need supercomputers at all? Expanding the boundaries of human knowledge has always been based on two cornerstones that cannot exist without each other - theory and experience. However, scientists are now faced with the fact that many tests have become practically impossible - in some cases due to their scale, in others because of the high cost or danger to human health and life. This is where powerful computers come to the rescue. Allowing experimentation with electronic models of reality, they become the “third pillar” of modern science and production.

The traditional scope of application of supercomputers has always been scientific research: plasma physics and statistical mechanics, condensed matter physics, molecular and atomic physics, theory of elementary particles, gas dynamics and turbulence theory, astrophysics. In chemistry - various areas of computational chemistry: quantum chemistry (including calculations of electronic structure for the purpose of designing new materials, such as catalysts and superconductors), molecular dynamics, chemical kinetics, theory of surface phenomena and solid state chemistry, drug design. Naturally, a number of application areas lie at the interfaces of relevant sciences, such as chemistry and biology, and overlap with technical applications. Thus, the problems of meteorology, the study of atmospheric phenomena and, first of all, the problem of long-term weather forecasting, for the solution of which the power of modern supercomputers is constantly insufficient, are closely related to the solution of a number of the problems of physics listed above. Among technical problems, for which supercomputers are used, we point out the problems of the aerospace and automotive industries (for example, design stages, crash test modeling), nuclear energy, prediction and development of mineral deposits, oil and gas industries (including problems of efficient exploitation of deposits, especially three-dimensional problems of their research), and, finally, the design of new microprocessors and computers, primarily the supercomputers themselves.

Supercomputers are also used for military purposes. In addition to the obvious tasks of developing weapons of mass destruction and designing aircraft and missiles, we can mention, for example, the design of silent submarines, etc. The most famous example is the American SDI program. The already mentioned MPP computer of the US Department of Energy will be used to simulate nuclear weapons, which will make it possible to completely cancel nuclear tests in this country.

5.Modern standards supercomputers

What are the current standards for supercomputers?

On the last international conference SC11 in Seattle surpassed 10 petafolops, i.e. 10 quadrillion computational operations per second. This performance was demonstrated by the K Computer of the Japanese corporation Fujitsu, containing 705,024 processor cores.

In third place is the most powerful US supercomputer with a performance of 1.8 petafolops, which has not been upgraded for almost three years.

In 2005, the challenge of breaking the 1 petafolops barrier was presented in America. They completed this task by 2008. But with the onset of the crisis, the development of American supercomputers declined sharply.

At the SC11 conference, a world record for information transfer speed was set. Information from Seattle to Victoria, Canada, a distance of 212 km, was transmitted via direct multi-modem fiber at a speed of 98 gigabits per second. At this speed, downloading a movie would take less than a second. Such high data rates are necessary for international collaborative processing of data produced by CERN at the Large Hadron Collider.

6. Supercomputer market in Russia.

Russia is increasingly becoming involved in the global process of intensifying the high-performance computing (HPC) market. In 2003, Paradigm, a leading supplier of geoscience data processing and drilling design technologies to the oil and gas industry, upgraded its seismic processing center in Moscow with the installation of an IBM server cluster of 34 dual-processor-based servers Intel Xeon. The new system accelerated the work of Paradigm's resource-intensive computing applications through the use of cluster technologies based on Linux OS. New opportunities for more accurate calculations will undoubtedly increase the competitiveness of Russian oil companies in the global market.

The two most important projects of 2005 were the installation of a domestically developed MVS-15000BM supercomputer at the Interdepartmental Supercomputer Center of the Russian Academy of Sciences (MSC) and installation at the NPO<Сатурн>IBM eServer Cluster 1350, which includes 64 dual-processor IBM eServer xSeries 336 servers. The latter is the largest supercomputer in Russia used in industry, and fourth in the overall ranking of supercomputers in the CIS. NGO<Сатурн>intends to use it in the design of aviation gas turbine engines for civil aviation aircraft. Issues of special engineering software for modeling various high-energy processes in the chemical, nuclear and aerospace industries are also being addressed. Thus, the IP-3D package is designed for numerical modeling of gas-dynamic processes under conditions of extremely high temperatures and pressures that cannot be reproduced in laboratory conditions.

Another largest domestic project in the field of supercomputers is Russian project MVS and Russian-Belarusian SKIF. The development of the Supercomputer of the MVS project was financed by the Ministry of Industry and Science of Russia, the Russian Academy of Sciences, the Ministry of Education of Russia, the Russian Foundation for Basic Research, and the Russian Foundation for Technological Development. Currently, machines of this series are installed at the MSC RAS and a number of regional scientific centers of the RAS (Kazan, Ekaterinburg, Novosibirsk) and are used primarily for scientific calculations. Also, one of the software developers for MVS is the company InterProgma, which operates in Chernogolovka as part of an already existing IT park. The company, in close cooperation with the Institute of Chemical Physics of the Russian Academy of Sciences, is developing basic software for large-scale simulation on supercomputer systems.

In Russia, SKIF and MVS are still perceived only as academic projects. The reason for this is that large Russian engineering corporations such as NPO<Сатурн>, prefer foreign supercomputers, since proven application solutions from world leaders such as IBM and HP are already equipped with ready-made target software and development tools, have best service. The creation of a common computing center focused on the industrial sector, with distributed access to computer time, will help make MBC and SKIF in demand for Russian industry. The creation of the Center will dramatically reduce the cost of maintaining a supercomputer, and will also speed up the process of creating and systematizing software (writing drivers, libraries, standard applications).

7.Conclusion

Just 10-15 years ago, supercomputers were something of an elite piece of equipment, available mainly to scientists from secret nuclear centers. However, the development of ultra-high performance hardware and software has made it possible to master the industrial production of these machines, and the number of their users currently reaches tens of thousands. In fact, these days the whole world is experiencing a genuine boom in supercomputer projects, the results of which are actively used not only by such traditional consumers of high technology as the aerospace, automotive, shipbuilding and radio-electronic industries, but also by the most important areas of modern scientific knowledge.

17.06.1995 Dmitry Volkov, Mikhail Kuzminsky

The dialectical spiral of computer technology development has made its next round - again, like ten years ago, in accordance with the requirements of life, supercomputer architectures are coming into fashion. Of course, these are no longer the monsters that veterans remember - new technologies and the demanding market for commercial applications have significantly changed the appearance of the modern supercomputer. Now these are not huge cabinets with unique equipment around which computer science shamans conjure, but quite ergonomic systems with unified software, compatible with their younger brothers. The article provides an overview of the current state and possible prospects for the development of supercomputers. The main areas of their application are considered and the features of various types of architectures characteristic of modern supercomputers are analyzed. Areas of application of supercomputers Supercomputers in Russia Architecture of modern supercomputers Vector supercomputers Multiprocessor vector supercomputers (MIMD) Multiprocessor SMP-based servers

What is a supercomputer? The Oxford Explanatory Dictionary of Computing, published almost 10 years ago, in 1986, reports that a supercomputer is a very powerful computer with a performance of over 10 MFLOPS (millions of floating point operations per second). Today, this result is surpassed not only by workstations, but even, at least in terms of peak performance, by PCs. In the early 90s, the limit was already drawn around the 300 MFLOPS mark. This year, judging by press reports, specialists from two leading “supercomputer” countries - the USA and Japan - agreed to raise the bar to 5 GFLOPS.

However, this approach to defining a supercomputer is not entirely correct. Obviously, any sane person would call the modern dual-processor Cray C90 computer a supercomputer. However, its peak performance is less than 2 GFLOPS. Closely related to this issue are restrictions (formerly by COCOM, now by the US State Department) on the supply of high-performance computing equipment to other countries. Computers with a performance of over 10,000 million theoretical operations per second. (MTOPS) are considered supercomputers as defined by the US State Department.

It would be more correct, in our opinion, to list the main features that characterize a supercomputer, among which, in addition to high performance, it should be noted:

the most modern technological level (for example, GaAs technology);
specific architectural solutions aimed at increasing performance (for example, the presence of operations on vectors);
price, usually over 1-2 million dollars.

In the USENET supercomputing teleconference, in connection with the rapid progress in RISC microprocessor technology and the corresponding increase in their performance, the question was once asked: when work station will it turn into a supercomputer? To which the answer came: “When it costs over 1 million dollars.” To illustrate, it can be noted that the Cray-1 computer at one time cost about 8 million dollars, and the Cgau T90 supercomputers announced this year, which have much higher performance, cost from 2.5 to 35 million dollars. The cost of creating a supercomputer MPP system in the Sandia Laboratory project of the US Department of Energy is about $46 million.

At the same time, there are computers that have all the above characteristics of a supercomputer, with the exception of the price, which for them ranges from several hundred to 2 million dollars. It's about about a mini-supercomputer with high performance, which is, however, inferior to large supercomputers. At the same time, minisupercomputers, as a rule, have noticeable better ratio price/performance and significantly lower operating costs: cooling system, power supply, room area requirements, etc. These computers are aimed at smaller computer centers - at the department level, and not at the entire university or corporation. Examples of such computers are the Cray J90, Convex C38XX and possibly the C4/XA. These also include modern supercomputer systems based on RISC microprocessors, for example, IBM SP2, SGI POWER CHALLENGE, DEC AlphaServer 8200/8400, etc.

From an architectural point of view, minisupercomputers do not represent a special direction, so they are not considered separately in the following.

Areas of application of supercomputers

What applications require such expensive equipment? It may seem that with the increase in productivity of desktop PCs and workstations, as well as servers, the very need for supercomputers will decrease. This is wrong. On the one hand, a number of applications can now be successfully run on workstations, but on the other hand, time has shown that a steady trend is the emergence of more and more new applications for which it is necessary to use a supercomputer.

First of all, we should point out the process of penetration of supercomputers into a previously completely inaccessible commercial sphere. We are talking not only about, say, graphic applications for cinema and television, where the same high performance on floating point operations is required, but above all about tasks involving intensive (in including, and operational) transaction processing for ultra-large databases. This class of tasks also includes decision support systems and the organization of information warehouses. Of course, we can say that to work with such applications, first of all, high I/O performance and speed when performing integer operations are required, and computer systems that are most optimal for such applications, for example, Himalaya MPP systems from Tandem, SGI SMP computers CHAL ENGE, AlphaServer 8400 from DEC is not exactly a supercomputer. But it should be remembered that such requirements arise, in particular, from a number of applications of nuclear physics, for example, when processing the results of experiments on particle accelerators. But nuclear physics has been a classic area of application for supercomputers since their inception.

Be that as it may, there has been a clear tendency towards convergence of the concepts of “mainframe”, “multiprocessor server” and “supercomputer”. It is worth noting that this is happening against the background of a massive transition to centralization and consolidation that has begun in many areas, as opposed to the process of disaggregation and decentralization.

The traditional scope of application of supercomputers has always been scientific research: plasma physics and statistical mechanics, condensed matter physics, molecular and atomic physics, theory of elementary particles, gas dynamics and turbulence theory, astrophysics. In chemistry - various areas of computational chemistry: quantum chemistry (including calculations of electronic structure for the purpose of designing new materials, such as catalysts and superconductors), molecular dynamics, chemical kinetics, theory of surface phenomena and solid state chemistry, drug design. Naturally, a number of application areas lie at the interfaces of relevant sciences, such as chemistry and biology, and overlap with technical applications. Thus, the problems of meteorology, the study of atmospheric phenomena and, first of all, the problem of long-term weather forecasting, for the solution of which the power of modern supercomputers is constantly insufficient, are closely related to the solution of a number of the problems of physics listed above. Among the technical problems for which supercomputers are used, we point out the problems of the aerospace and automotive industries, nuclear energy, prediction and development of mineral deposits, the oil and gas industry (including problems of efficient exploitation of deposits, especially three-dimensional problems of their study), and, finally, the design of new microprocessors and computers, primarily the supercomputers themselves.

Supercomputers are traditionally used for military purposes. In addition to the obvious tasks of developing weapons of mass destruction and designing aircraft and missiles, we can mention, for example, the design of silent submarines, etc. The most famous example is the American SDI program. The already mentioned MPP computer of the US Department of Energy will be used to simulate nuclear weapons, which will make it possible to completely cancel nuclear tests in this country.

Analyzing the potential needs for supercomputers of applications that exist today, we can roughly divide them into two classes. The first includes applications in which it is known what level of performance must be achieved in each specific case, for example, long-term weather forecasting. The second category includes tasks that are characterized by a rapid increase in computational costs with increasing size of the object under study. For example, in quantum chemistry, ab initio calculations of the electronic structure of molecules require computational resources proportional to N^4 or N^5, where N conventionally characterizes the size of the molecule. Nowadays, many molecular systems are forced to be studied in a simplified model representation. With even larger molecular entities in reserve (biological systems, clusters, etc.), quantum chemistry provides an example of an application that is a “potentially infinite” user of supercomputing resources.

There is one more problem with the use of supercomputers that needs to be mentioned - this is the visualization of data obtained as a result of calculations. Often, for example, when solving differential equations using the grid method, one has to deal with gigantic volumes of results that a person is simply not able to process in numerical form. Here, in many cases, it is necessary to turn to a graphical form of presenting information. In any case, the problem arises of transporting information over a computer network. Solving this set of problems has recently received increasing attention. In particular, the famous US National Center for Supercomputing Applications (NCSA) together with Silicon Graphics is working on the “supercomputing environment of the future” program. This project will integrate the capabilities of SGI's POWER CHALLENGE supercomputers and visualization tools with the information superhighway.

Supercomputers in Russia

Supercomputers are a national treasure, and their development and production should undoubtedly be one of the priorities of the state technical policy of countries that are world leaders in the field of science and technology. A brilliant example of a deep understanding of the entire complex of relevant problems is an article by the famous Nobel laureate in physics C. Wilson. Published over ten years ago, it is still of interest to the Russian reader.

Almost the only countries developing and producing supercomputers on a large scale are the USA and Japan. Their own supercomputers were created in India and China. Most developed countries, including a number of Eastern European countries, prefer to use supercomputers made in the USA and Japan.

The situation with the development of supercomputers in Russia today obviously leaves much to be desired. Work on domestic supercomputers has been carried out in several organizations in recent years. Under the direction of Academician V.A. Melnikov, a vector supercomputer “Electronics CC-100” was developed with an architecture reminiscent of Sgau-1. At ITMiVT RAS, work is underway to create Elbrus-3 supercomputers. This computer can have up to 16 processors with a clock speed of 10 ns. According to the developers' estimates, on LINPACK tests with N = 100, the processor speed will be 200 MFL0PS, with N = 1000 - 370 MFLOPS. Another development made at this institute is the Modular Conveyor Processor (MPP), which uses an original vector architecture, but in terms of speed it should probably be inferior to Elbrus-3.

Another center of work on domestic supercomputers is NICEVT, famous for its work on EC computers. A number of interesting developments were carried out there - various models of vector supercomputers EC 1191 using ECL technology, and work is underway on a new supercomputer "AMUR", which uses CMOS technology. A number of organizations, led by the Institute of Problems of Mathematics and Mathematics of the Russian Academy of Sciences, are working on the creation of an MPP-computer MVS-100, the processing elements of which use i860XP microprocessors, and T805 transputers are used to organize communications. Although prototypes of some of the above-mentioned domestic computers are available, none of them are commercially produced.

The situation with the provision of supercomputers to Russian organizations is perhaps even worse. We will limit ourselves to information about the state of affairs and prospects for the future in research institutes and universities, which, as mentioned above, are among the main potential users of supercomputers.

Most supercomputer installations probably use Convex products. Several organizations use old models of mini-supercomputers of the Clxx, C2xx series, which are already inferior in performance to modern workstations. In St. Petersburg, the Convex C3800 series minisupercomputer was installed in the State Committee for Higher Educational Institutions; in Moscow, the SPP 1000/CD supercomputer system was recently installed at the Institute of Problems of Mathematics of the Russian Academy of Sciences. There are plans to install other supercomputers (for example, SGI POWER CHALLENGE) in a number of RAS institutes.

Meanwhile, the lack of opportunities to use supercomputers hinders the development of domestic science and makes the successful development of entire areas of scientific research fundamentally impossible. Purchasing one or two, even very powerful, supercomputers will not help solve this problem. And it’s not just about the cost of purchasing them and the costs of maintaining functionality (including power supply and cooling). There are a number of other reasons (for example, information delivery over a computer network) that prevent the effective use of supercomputers.

It seems more appropriate to approach the one proposed Russian Foundation basic research. The developed "Program for creating integrated communication networks and databases of fundamental science and education" for 1995-1998. provides for the organization of a number of regional and subject-specific supercomputer centers. In such centers, for example, relatively cheap mini-supercomputers with a better cost/performance ratio can be installed. As a matter of fact, one only needs to look at the TOP500 list to discover a clear trend towards replacing large (and expensive) supercomputers with relatively inexpensive supercomputers, which are already capable of solving the lion's share of potential problems.

As for domestic supercomputers, without the necessary government support for projects for their development, we cannot count on the creation of industrial designs in the next 1-2 years, and it is unlikely that such computers will be able to form the basis of the supercomputer fleet in the domestic supercomputer centers being created today.

Architecture of modern supercomputers

In this review, it makes no sense to dwell on the details of the classification of supercomputer architecture; we will limit ourselves only to a consideration of typical supercomputer architectures that are widespread today, and we will present Flynn’s classic taxonomy.

In accordance with it, all computers are divided into four classes depending on the number of command and data streams. The first class (von Neumann sequential computers) includes ordinary scalar single-processor systems: single command stream - single data stream (SISD). The personal computer has a SISD architecture, and it does not matter whether the PC uses pipelines to speed up operations.

The second class is characterized by the presence single command stream but multiple nomoka data (SIMD). Single-processor vector or, more precisely, vector-pipeline supercomputers, for example, Cray-1, belong to this architectural class. In this case, we are dealing with one stream of (vector) commands, but there are many data streams: each element of the vector is included in a separate data stream. Matrix processors, for example, the once famous ILLIAC-IV, belong to the same class of computer systems. They also have vector instructions and implement vector processing, but not through pipelines, as in vector supercomputers, but using processor matrices.

The third class - MIMD - includes systems that have multiple command stream and multiple data stream. It includes not only multiprocessor vector supercomputers, but also all multiprocessor computers in general. The vast majority of modern supercomputers have MIMD architecture.

The fourth class in Flynn's taxonomy, MISD, is of no practical interest, at least for the computers we analyze. Recently, the term SPMD (single program multiple data) is also often used in the literature. It does not refer to computer architecture, but to a model of program parallelization, and is not an extension of Flynn's taxonomy. SPMD usually refers to MPP (i.e. MIMD) systems and means that several copies of the same program are executed in parallel on different processor nodes with different data.

It is also interesting to mention a fundamentally different direction in the development of computer architectures - data flow machines. In the mid-80s, many researchers believed that the future of high-performance computers was connected precisely with computers controlled by data flows, in contrast to all the classes of computer systems controlled by command flows that we considered. In data flow machines, many instructions can be executed simultaneously, for which operands are ready. Although computers with such an architecture are not commercially produced today, some elements of this approach are reflected in modern superscalar microprocessors, which have many parallel functional devices and a command buffer waiting for operands to be ready. Examples of such microprocessors include HP PA-8000 and Intel Pentium Pro.

In accordance with Flynn's classification, consideration of supercomputer architecture should begin with the SISD class. However, all vector-pipeline (hereinafter simply vector) supercomputers have an architecture “no less than” SIMD. As for supercomputer servers using modern high-performance microprocessors, such as the R8000-based SGI POWER CHALLENGE or the Alpha 21164-based DEC AlphaServer 8200/8400, their minimum configurations are single-processor. However, if we do not consider the actual architecture of these microprocessors, then all the features of the architecture of the servers themselves should be analyzed in a “natural” multiprocessor configuration. Therefore, we will begin the analysis of supercomputer architectures immediately with the SIMD class.

Vector supercomputers

Among modern supercomputers, single-processor vector supercomputers have this architecture. Almost all of them are also available in multiprocessor configurations belonging to the MIMD class. However, many features of the architecture of vector supercomputers can be understood by considering even single-processor systems.

A typical diagram of a single-processor vector supercomputer is presented using the example of FACOM VP-200 from the Japanese company Fujitsu. Other vector supercomputers, for example, from Cray Research and Convex, have a similar architecture. Common to all vector supercomputers is the presence in the system of vector operations instructions, for example, addition of vectors that allow working with vectors of a certain length, for example, 64 elements of 8 bytes. In such computers, operations with vectors are usually performed on vector registers, which, however, is not at all necessary. The presence of mask registers allows you to execute vector commands not on all vector elements, but only on those pointed to by the mask.

Of course, specific implementations of vector architecture in various supercomputers have their own modifications of this general scheme. For example, in computer systems of the VP series from Fujitsu, support for the ability to reconfigure the vector register file is implemented in hardware - you can, for example, increase the length of vector registers while simultaneously proportionally reducing their number.

Since the Cray-1, many vector supercomputers, including Fujitsu's VP series and Hitachi's S series, have included an important feature for speeding up vector calculations called instruction chaining. Consider, for example, the following sequence of instructions that operate on vector V-registers on Cray computers:

V2=V0*V1 V4=V2+V3

It is clear that the second instruction cannot begin to be executed immediately after the first - for this, the first instruction must form the V2 register, which requires a certain number of clock cycles. The hooking facility, however, allows the second instruction to begin execution without waiting for the first one to complete: simultaneously with the appearance of the first result in register V2, a copy of it is sent to functional device addition, and the second command is run. Of course, the details of the linking capabilities of different vector commands differ from computer to computer.

As for scalar processing, the corresponding instruction subsystem in Japanese supercomputers Fujitsu and Hitachi is compatible with IBM/370, which has obvious advantages. In this case, traditional cache memory is used to buffer scalar data. On the contrary, Cray Research, starting with Crau-1, abandoned the use of cache memory. Instead, its computers use special software-addressable buffer B and T registers. It was only in the latest series, Cray T90, that intermediate cache memory for scalar operations was introduced. Note that on the RAM - vector registers path there is no intermediate buffer memory, which makes it necessary to have a high throughput RAM subsystem: in order to support high speed calculations, you need to quickly load data into vector registers and write the results back to memory.

So far, we have considered vector computers in which the operands of the corresponding instructions are located in vector registers. In addition to those mentioned Fujitsu computers and Hitachi, vector registers have computers of the SX series of another Japanese company NEC, including the most powerful computers of the SX-4 series, as well as all vector computers from both Cray Research, including C90, M90 and T90, and from Cray Computer, including Cray -3 and Cray-4, and vector mini-supercomputers from Convex of the Cl, C2, C3 and C4/XA series.

But some vector supercomputers, for example, the IBM ES/9000, work with vector operands located directly in RAM. This approach is likely less promising from a performance standpoint, particularly because each vector instruction requires fast fetching of vector operands from memory and writing the results back to keep the computation rate high.

Multiprocessor vector supercomputers (MIMD)

All of the mentioned vector supercomputers are produced in multiprocessor configurations, which already belong to the MIMD class.

In the architecture of multiprocessor vector computers, two most important characteristics can be noted: the symmetry (equality) of all processors in the system and the sharing of a common field of RAM by all processors. Such computer systems are called highly coupled. If in single-processor vector computers to create an effective program it must be vectorized, then in multiprocessor computers the task of parallelizing the program arises for its execution simultaneously on several processors.

The problem of parallelization is perhaps more complex, since it requires synchronization of parallel processes. Practice has shown the possibility of effectively parallelizing a large number of algorithms for the considered strongly coupled systems. The corresponding approach to parallelization on such computers is sometimes called the shared model. shared memory.

Multiprocessor SMP servers based on RISC architecture microprocessors

The performance of some modern RISC architecture microprocessors has become comparable to the performance of vector computer processors. As a consequence of this, supercomputers of a new architecture using these achievements appeared - tightly coupled MIMD class computers, which are symmetrical multiprocessor servers with a common field of RAM. It makes sense to pay more attention to these promising systems than to other computer architectures, since the corresponding range of issues in the domestic computer literature has not been discussed fully enough.

The most famous supercomputer servers with a similar SMP architecture are DEC AlphaServer 8200/8400 and SGI POWER CHALLENGE. They are characterized by the use of a high-performance system bus, into the slots of which three types of modules are inserted - processor, RAM and I/O. Conventional, slower I/O buses, such as PCI or VME64, are already connected to the I/O modules. Obviously, such a design has a high degree of modularity and easily allows for configuration expansion, which is limited only by the available number of system bus slots and its performance.

Memory modules typically use DRAM technology, which allows large memory capacities to be achieved at a relatively low cost. However, the speed of data exchange between processors and memory in such servers is many times lower than the throughput of a similar path in vector supercomputers, where RAM is based on more expensive Java technology. This is one of the main differences in the approaches to supercomputer computing used for multiprocessor vector computers and SMP servers. In the former there is usually relatively little big number vector registers, therefore, as already noted, to maintain high performance it is necessary to quickly load data into them or, conversely, write information from them to RAM. Thus, high performance of the processor-memory path is required.

In SMP servers, the bandwidth of memory modules is much lower, and the overall speed of data exchange with processor modules is also limited (albeit by high) bus bandwidth. In addition, the system bus may be busy transferring data due to the operation of I/O modules. To illustrate the orders of magnitude, the following data can be cited: the guaranteed throughput of the TurboLaser system bus in the AlphaServer 8200/8400 is 1.6 GB/s and 1.2 GB/s for the POWERpath-2 bus in POWER CHALLENGE, and the RAM bandwidth in the Cgau T90 is 800 GB/s Therefore, in SMP servers, developers strive to reduce the very need for data exchanges on the processor-memory path. For this purpose, instead of a small amount of memory in vector registers (which is why they require fairly frequent reboots), microprocessors in supercomputer SMP systems are equipped with a very large cache memory. big size, for example, 4 MB per microprocessor in AlphaServer 8200/8400 and POWER CHAL ENGE. As a result, the goal can be achieved for a very wide range of applications.

Modern computers with SMP architecture and clusters based on them have in many ways characteristics comparable to large vector supercomputers, with the exception of RAM bandwidth; If we add to this the low operating costs of maintaining SMP systems, it becomes clear why the use of these much cheaper (compared to vector) supercomputers has become widespread over the past 2 years.

The SMP systems analyzed here are not required to have a bus architecture. Instead of a bus, a switch can be used. A similar approach is used, for example, inside hypernodes of Convex Exemplar SPP computers. However, almost everything said in this section remains valid in this case.

Clusters

Clusters are the cheapest way to increase the performance of already installed computers. In fact, a cluster is a set of several computers connected through some communication infrastructure. An ordinary computer network can serve as such a structure, however, for reasons of increased productivity, it is desirable to have high-speed connections (FDDI/ATM/HiPPI, etc.). Clusters can be formed both from different computers (heterogeneous clusters) and from identical ones (homogeneous clusters). Obviously, all such systems belong to the MIMD class. Clusters are classic example loosely coupled systems. The article is devoted to various cluster systems.

The advantage of the cluster approach compared to SMP servers is improved scaling capabilities. Unlike SMP-architecture servers, where configuration expansion is limited by bus bandwidth, adding computers to a cluster allows you to increase the bandwidth of RAM and I/O subsystems.

In cluster systems, to organize interaction between processes running on different computers when solving one problem, various message exchange models (PVM, MPI, etc.) are used. However, the task of parallelization in such systems with memory distributed between individual computers within these models is much more complex than in a shared memory field model, such as in SMP servers. To this should be added the purely hardware problems of delays in messaging and increasing data transfer speeds. Therefore, the range of problems that can be effectively solved on cluster systems, compared to symmetric, strongly coupled systems, is quite limited. For parallel processing of database queries, such systems also have their own approaches (see, for example,).

Various supercomputers can be combined into clusters, for example, minisupercomputer Cgau J90, but the most famous clusters in the world of supercomputers are IBM SP2 and SGI POWER CHAL ENGEarray. The possibility of having a large number of processor nodes in SP2 allows us to simultaneously classify this computer as an Mpp system.

MPP-system (MIMD)

The main feature by which a system is classified as an MPP architecture is the number of processors (n). There is no strict limit, but it is usually believed that when n >= 128 this is already an MRR, and when n

It is not at all necessary that the MPP system have distributed RAM, in which each processor node has its own local memory. For example, the SPP1000/XA and SPP1200/XA computers are an example of massively parallel systems, the memory of which is physically distributed between hypernodes, but is logically common to the entire computer. However, most MPP computers have both logically and physically distributed memory.

In any case, MPP systems belong to the MIMD class. If we talk about MPP computers with distributed memory and ignore the organization of I/O, then this architecture is a natural extension of the cluster architecture to a large number of nodes. Therefore, such systems are characterized by all the advantages and disadvantages of clusters. Moreover, due to the increased number of processor nodes, both the pros and cons become much more significant (a processor node is a computer unit that can contain several processors, for example, as in SNI/Pyramid RM1000 computers, and itself have an SMP architecture).

Thanks to scalability, it is MPP systems that are today the leaders in achieved computer performance; the most striking example of this is Intel Paragon. On the other hand, parallelization problems in MPP systems become even more difficult to resolve compared to clusters containing few processors. In addition, the performance gain usually decreases quite quickly as the number of processors increases. It is easy to increase the theoretical performance of a computer, but it is much more difficult to find tasks that could effectively load the processor nodes.

Today, not many applications can run efficiently on an Mpp computer, and there is also the problem of program portability between Mpp systems with different architectures. The attempt to standardize messaging models in recent years does not solve all the problems. The efficiency of parallelization in many cases strongly depends on the details of the Mpp system architecture, for example, the topology of the connection of processor nodes.

The most efficient topology would be one in which any node could communicate directly with any other node. However, this is technically difficult to implement in MPP systems. Typically, the processor nodes in modern MPP computers form either a two-dimensional lattice (for example, in the SNI/Pyramid RM1000) or a hypercube (as in nCube supercomputers).

Since synchronizing processes running in parallel on nodes requires the exchange of messages that must reach from any node in the system to any other node, important characteristic is the diameter of the system c1 - the maximum distance between nodes. In the case of a two-dimensional lattice d ~ sqrt(n), in the case of a hypercube d ~ 1n(n). Thus, as the number of nodes increases, the hypercube architecture is more beneficial.

The time it takes to transfer information from node to node depends on the start delay and transmission speed. In any case, during the transmission time, processor nodes manage to execute many commands, and this ratio between the speed of processor nodes and the transmitting system is likely to be maintained - the progress in processor performance is much greater than in the throughput of communication channels. Therefore, the communication channel infrastructure is one of the main components of an Mpp computer.

Despite all the difficulties, the scope of application of MPP computers is gradually expanding. Various MPP systems are used in many of the world's leading supercomputer centers, as is clearly evident from the TOP500 list. In addition to those already mentioned, special mention should be made of the Cray T3D and Cray TZE computers, which illustrate the fact that the world leader in the production of vector supercomputers, Cray Research, no longer focuses exclusively on vector systems. Finally, we can't help but remember that the US Department of Energy's newest supercomputer project will be based on a Pentium Pro-based MPP system.

Supercomputer performance estimates

Because supercomputers have traditionally been used to perform computation on real numbers, most of today's performance evaluations are related to these calculations. First of all, these include peak performance, measured in millions of floating point operations that a computer can theoretically perform in 1 second (MFLOPS). Peak performance is a value that is practically unattainable. This is due, in particular, to the problems of filling functional conveyor devices, which is typical not only for vector supercomputers, but also for computers based on RISC architecture microprocessors. This is especially important for super-pipeline microprocessor architectures, for example, DEC Alpha, which are characterized by the use of relatively long pipelines. It is clear that the larger the pipeline, the more “initialization” time is needed in order to fill it. Such pipelines are effective when working with long vectors. Therefore, to evaluate vector supercomputers, such a concept as half-performance length was introduced - the vector length at which half of the peak performance is achieved.

More realistic performance estimates are based on the execution times of various tests. Of course, the best tests are real user tasks. However, such assessments, firstly, are very specific, and, secondly, are often completely inaccessible or absent. Therefore, more universal tests are usually used, but traditional methods for assessing the performance of microprocessors - SPEC - are, as a rule, not used in the world of supercomputers. This is due, in particular, to their low information content - especially SPEC 92 - for supercomputer applications, although new standard SPEC 95 gives a more realistic picture of performance. Today, SPEC ratings are available only for supercomputers using RISC microprocessors. A special new standard for high performance computing, SPEChpc96, was recently announced.

Since loops usually take up most of the execution time of programs, they are sometimes used as tests, for example, the well-known Livermore loops. The most popular performance test today is Linpack, which is a solution of system AND linear equations using the Gaussian method. Since we know how many operations with real numbers need to be done to solve the system, knowing the calculation time, we can calculate the number of operations performed per second. There are several modifications of these tests. Typically, computer manufacturers provide results for N 100. A standard Fortran program is freely available, which must be executed on a supercomputer to obtain the test result. This program cannot be modified except to replace subroutine calls that give access to CPU execution time. Another standard test concerns the case of N = 1000, which assumes the use of long vectors. These tests can be run on computers with different numbers of processors, also providing estimates of the quality of parallelization.

For MPP systems, the Linpack-parallel test is more interesting, in which performance is measured at large AND and the number of processors. The leader here is the 6768-processor Intel Paragon (281 GFLOPS with N = 128600). As for processor performance, at N = 100 the leader is Cray T916 (522 MFLOPS), at N = 1000 and in terms of peak performance - Hitachi S3800 (6431 and 8000 MFLOPS, respectively). For comparison, the processor in the AlphaServer 8400 has 140 MFLOPS at N=100 and 411 MFLOPS at N=1000.

For highly parallel supercomputers, NAS parallel benchmark tests have recently been increasingly used, which are especially good for computational gas and fluid dynamics problems. Their disadvantage is that they fix the solution algorithm, not the program text. Additional information about the various tests can be found in.

***

Super today computer world There is a new wave, driven both by advances in microprocessor technology and by the emergence of a new range of tasks that go beyond traditional research laboratories. There is rapid progress in the performance of RISC microprocessors, which is growing noticeably faster than the performance of vector processors. For example, the HP PA-8000 microprocessor is only about two times behind the Cray T90. As a result, in the near future, vector supercomputers are likely to be further replaced by computers using RISC microprocessors, such as, for example, IBM SP2, Convex/HP SPP, DEC AlphaServer 8400, SGI POWER CHALENGE. This was confirmed by the results of the TOP500 rating, where the leaders in the number of installations were the POWER CHALLENGE and SP2 systems, ahead of the models of the leading supercomputer manufacturer - Cray Research.

However, it is obvious that the development of vector supercomputers will continue, at least from Cray Research. It may be starting to be held back by compatibility requirements with older models. Thus, the Cray-4 system from Cray Computer, which has configuration characteristics and performance close to the latest Cray T90 system from Cray Research at 2 times lower price, but is incompatible with Cray Research computers, has not found a consumer. As a result, Cray Computer went bankrupt.

Systems based on Mpp architectures, including those with distributed memory, are successfully developing. The emergence of new high-performance microprocessors using low-cost CMOS technology significantly increases the competitiveness of these systems.

Regarding new solutions from VLIW architectures, we can confidently assume that, at least in the next two years, RISC processors have nothing to fear.

This review does not pretend to be complete, but is an attempt to present a general picture of the state of affairs in the field of supercomputers. For a more detailed look at the architectures of specific systems, you can refer to other publications in this issue of the journal.

Literature

1. ComputerWorld Russia, # 9, 1995.

2. K. Wilson, in collection. "High-speed computing." M. Radio and Communication, 1988, pp. 12-48.

3. B.A. Golovkin, “Parallel computing systems”. M.. Science, 1980, 519 p.

4. R. Hockney, K. Jesshope, “Parallel computers. M.. Radio and Communications, 1986, 390 pp.

5. Flynn I., 7., IEEE Trans. Comput., 1972, o.S-21, N9, pp. 948-960.

6. Russel K.M., Commun. ACM, 1978, v. 21, # 1, pp. 63-72.

7. T. Motooka, S. Tomita, H. Tanaka, T. Saito, T. Uehara, “VLSI Computers,” m.l. M. Mir, 1988, 388 p.

8. M. Kuzminsky, Processor RA-8000. Open systems, # 5, 1995.

9. Open Systems Today, # 11, 1995.

10. ComputerWorld Russia, ## 4, 6, 1995.

11. ComputerWorld Russia, # 8, 1995.

12. Open Systems Today, # 9, 1995.

13. ComputerWorld Russia, # 2, 1995.

14. ComputerWorld Russia, # 12, 1995.

15. Victor Shnitman, Exemplar SPP1200 Systems

16. Mikhail Borisov, UNIX clusters. Open Systems, #2, 1995, pp.22-28.

17. V. Schmidt, IBM SP2 systems. Open Systems, #6, 1995.

18. Natalya Dubova, nCube Supercomputers. Open systems, # 2, 1995, pp. 42-47.

19. Dmitry Frantsuzov, Test for assessing the performance of supercomputers. Open Systems, #6, 1995.

20. Dmitry Volkov, . Open Systems, No. 2, 1994, pp. 44-48.

21. Andrey Volkov, TRS tests. DBMS, # 2, 1995, pp. 70-78.

Mikhail Kuzminsky, IOC RAS.Dmitry Volkov () -- IPM RAS (Moscow).

Supercomputers: past, present and future

The fact is that most supercomputers demonstrate amazing performance thanks to this particular (vector) type of parallelism. Any programmer, developing programs in familiar high-level languages, has probably repeatedly encountered so-called DO loops. But few people have thought about the productivity potential of these commonly used operators. A well-known specialist in the field of programming systems, D. Knuth, showed that DO loops occupy less than 4% of the code of programs in the FORTRAN language, but require more than half of the computing time of the task.

The first supercomputer to take advantage of vector processing was ILLIAC IV (SIMD architecture). In the early 60s, the same Slotnik group, united in the Center for Advanced Computing Technologies at the University of Illinois, began the practical implementation of a vector supercomputer project with a matrix structure. The production of the machine was undertaken by Burroughs Corp. The technical side of the project is still striking in its scale: the system was supposed to consist of four quadrants, each of which included 64 processing elements (PE) and 64 memory modules, united by a switch based on a hypercube network. All quadrant PEs process the vector instruction sent to them by the command processor, each performing one elementary vector operation, the data for which is stored in the memory module associated with this PE. Thus, one quadrant of ILLIAC IV is capable of simultaneously processing 64 vector elements, and the entire system of four quadrants is capable of processing 256 elements. In 1972, the first ILLIAC IV system was installed at NASA Ames Research Center. The results of its operation in this organization received mixed assessments. On the one hand, the use of a supercomputer made it possible to solve a number of complex aerodynamic problems that other computers could not cope with. Even the fastest computer for scientific research of that time - Control Data CDC 7600, which, by the way, was designed by the “patriarch of supercomputers” Seymour Cray, could provide performance of no more than 5 MFLOPS, while ILLIAC IV demonstrated an average performance of approximately 20 MFLOPS. On the other hand, ILLIAC IV was never brought to full configuration of 256 PEs; In fact, the developers limited themselves to only one quadrant. The reasons were not so much technical difficulties in increasing the number of processor elements of the system, but problems associated with programming data exchange between processor elements through a memory module switch. All attempts to solve this problem using system software have failed, resulting in each application requiring manual programming of switch transmissions, which has led to unsatisfactory user reviews.

If the ILLIAC IV developers had managed to overcome the problems of programming a matrix of processor elements, then the development of computing technology would probably have taken a completely different path and today computers with a matrix architecture would dominate. However, neither in the 60s, nor later, a satisfactory and universal solution to two such fundamental problems as programming the parallel operation of several hundred processors and at the same time ensuring a minimum of computational time spent on data exchange between them was found. It took about 15 more years of efforts by various companies to implement supercomputers with matrix architecture to make a final diagnosis: computers of this type are not able to satisfy a wide range of users and have a very limited scope of application, often within one or several types of tasks.

As the means of ultra-high-speed data processing are mastered, the gap between the improvement of program vectorization methods, i.e. automatic conversion during compilation of sequential language constructs into vector form, and the extreme complexity of programming switching and data distribution between

Operands and result recording. For example, if a pipeline that performs one elementary operation in five clock cycles is replaced with four of the same pipelines, then with a vector length of 100 elements, the vector instruction is accelerated by only 3.69, not 4 times. The lag effect of performance growth from increasing the number of pipelines is especially noticeable when the processor spends a significant amount of time exchanging data between the pipeline and memory. This issue was not properly appreciated during the development of the CYBER-205, and as a result, the memory-to-memory architecture of this model so degraded the dynamic parameters of the four pipelines of its vector processor that a very high degree of vectorization of programs was required to achieve performance close to 200 MFLOPS (about 1 thousand elements in the vector), i.e. potentially the most powerful supercomputer of the 70s could actually effectively process only a limited class of problems. Of course, such a miscalculation had a negative impact on the market fate of CYBER-205 and on the entire supercomputer program of Control Data. After CYBER-205, CDC stopped trying to develop the supercomputer market.

The use of the register-register architecture in NEC SX supercomputers made it possible to neutralize the disadvantages of multi-pipelining processing, and the NEC SX-2 model with 16 vector pipelines became the first supercomputer to overcome the milestone of a billion floating point operations per second - its peak performance was 1.3 GFLOPS . Hitachi took a different path. In the S-810 series supercomputers, the emphasis was placed on the parallel execution of six vector commands at once. Next, Hitachi continues the line of this family with the S-810/60 and S-810/80 models; the latter takes a respectable third place in performance testing on the LINPACK package, second only to the giants from CRAY and NEC. The relative commercial stability of Hitachi supercomputers can be explained by the fact that they, like Fujitsu supercomputers, are fully compatible with the IBM/370 system for scalar operations. This allows you to use programs written in IBM VS FORTRAN and ANSI X3.9 (FORTRAN 77), as well as use the standard MVS TSO/SPF operating environment and most IBM system extensions, including I/O control for IBM-compatible disk and tape drives. drives. In other words, Japanese supercomputers from Hitachi and Fujitsu were the first in the world of supercomputers to use a user-friendly interface for users of the most common computing system at that time - IBM/370.

The onslaught of Japanese manufacturers was impressive, but then S. Cray delivered a timely counterattack - in 1982, the first model of the CRAY X-MP family of supercomputers appeared on the market, and two years later at the Lawrence Livermore National Physical Laboratory. Lawrence, the first copy of the CRAY-2 supercomputer was installed. The machines from Cray Research were ahead of their competitors in the main thing - they marked the birth of a new generation of ultra-high-performance computers, in which vector-pipeline parallelism was complemented by multiprocessor processing. Cray used extraordinary solutions to the problem of increasing productivity in his computers. Having retained the architecture and structural developments of CRAY-1 in CRAY-2 and CRAY X-MP, it crushed competitors on two fronts at once: it achieved a record low machine cycle time (4.1 ns) and expanded system parallelism through multiprocessing. As a result, Cray Research retained its title absolute champion in terms of performance: CRAY-2 demonstrated a peak performance of 2 GFLOPS, outperforming the NEC SX-2 - the fastest Japanese supercomputer - by one and a half times. To solve the problem of optimizing the machine cycle, Cray went further than the Japanese, who already owned ECL-LSI technology, which allowed the Fujitsu VP to achieve a machine cycle duration of 7.5 ns. In addition to the fact that CRAY-2 used high-speed ECL circuits, the design of the CPU blocks provided maximum density installation of components. To cool such a unique system, which produced no less than 195 kW, the technology of immersing modules in fluorine carbide, a special liquid refrigerant produced by the American company 3M, was used.

Second revolutionary solution, implemented in the CRAY-2 supercomputer, was that the amount of RAM was increased to 2 GB. S. Cray managed to fulfill Flynn’s criterion for balancing performance and RAM capacity: “For every million processor performance operations, there must be at least 1 MB of RAM capacity.” The essence of the problem is that typical problems of hydro- and aerodynamics, nuclear physics, geology, meteorology and other disciplines, solved using supercomputers, require processing a significant amount of data to obtain results of acceptable accuracy. Naturally, with such volumes of calculations, the relatively small capacity of RAM causes intensive exchange with disk memory, which, in full accordance with Amdahl's law, leads to a sharp decrease in system performance.

Still, the new qualitative level of the CRAY-2 supercomputer was determined not so much by the ultra-short machine cycle time and ultra-large RAM capacity, but by the multiprocessor architecture borrowed from another Cray Research development - the CRAY X-MP family of multiprocessor supercomputers. Its three base models—X-MP/1, X-MP/2, and X-MP/4—offered users single-, dual-, or quad-processor system configurations with 410 MFLOPS per processor. The range of available options expanded due to the possibility of installing memory of different sizes (from 32 to 128 MB per system). This market-oriented approach to building a supercomputer subsequently brought tangible commercial benefits to Cray Research. The multiprocessor architecture of CRAY supercomputers was developed taking into account the achievements and shortcomings of multiprocessor mainframes, primarily from IBM. Unlike “classical” IBM operating systems, which use the mechanism of global variables and semaphores in shared memory for process interaction, the CRAY multiprocessor architecture involves data exchange between processors through special cluster registers; in addition, to service the interaction of processes, the CRAY architecture provides hardware implemented semaphore flags that are set, reset and analyzed using special commands, which also speeds up interprocessor communication and ultimately increases system performance. As a result of these innovations, the acceleration coefficient of the dual-processor CRAY X-MP/2 supercomputer relative to the single-processor CRAY X-MP/1 is no less than 1.86.

Unlike the CRAY X-MP family, models of which run the COS operating system (Cray Operating System), CRAY-2 was equipped with a new operating system CX-COS, created by Cray Research based on Unix System V.

In the second half of the 80s, Control Data, which “went out of business” after the failure with the CYBER-205 model, reappeared on the supercomputer market. Strictly speaking, the development of a new eight-processor supercomputer was undertaken by ETA Systems, a subsidiary of CDC, but almost the entire potential of Control Data was used in this project. Initially, the project, called ETA-10, received government support through contracts and grants to potential users, causing a stir among ultra-high-speed processing specialists. After all, the new supercomputer was supposed to achieve a performance of 10 GFLOPS, i.e. five times faster than CRAY-2 in terms of computation speed. The first sample of the ETA-10 with a single processor with a performance of 750 MFLOPS was demonstrated in 1988, but then things got worse. In the second quarter of 1989, Control Data announced the winding down of ETA Systems due to unprofitable production.

The giant of the computer world, IBM, has not remained aloof from the problems of ultra-high performance. Not wanting to lose its users to competitors from Cray Research, the company began a program to release older models of the IBM 3090 family with vector processing tools (Vector Facility). The most powerful model This series - IBM 3090/VF-600S is equipped with six vector processors and 512 MB of RAM. This line was subsequently continued by such ESA architecture machines as the IBM ES/9000-700 VF and ES/9000-900 VF, the performance of which in the maximum configuration reached 450 MFLOPS.

Another well-known company in the computer world is Digital Equipment Corp. - in October 1989, announced a new series of mainframes with vector processing capabilities. The older model VAX 9000/440 is equipped with four vector processors that increase computer performance up to 500 MFLOPS.

The high cost of supercomputers and vector mainframes turned out to be beyond the means of a fairly wide range of customers potentially ready to take advantage of computer technologies parallel computing. These include small and medium-sized research centers and universities, as well as manufacturing companies that need high-performance, but relatively inexpensive computing technology.

On the other hand, such major supercomputer manufacturers as Cray Research, Fujitsu, Hitachi and NEC clearly underestimated the needs of “average” users, focusing on achieving record performance indicators and, unfortunately, even more record prices for their products. Control Data's strategy turned out to be very flexible, which, after the failure with CYBER-205, focused on the production of middle-class scientific computers. At the end of 1988, the production of machines of the CYBER-932 type was twice as high as the production of older models of the CYBER-900 series and supercomputers with the CDC brand. The main competitor of Control Data in the market of small-sized parallel computers, which are collectively called “mini-supercomputers,” was the future leader in the world of mini-supercomputers, Convex Computer. In its developments, Convex was the first to implement vector architecture using very large-scale integrated circuits (VLSI) using CMOS technology. As a result, users received a series of relatively inexpensive computers priced under $1 million, with performance ranging from 20 to 80 MFLOPS. The demand for these cars exceeded all expectations. The apparently risky investment in the Convex program turned into a quick and solid return from its implementation. The history of the development of supercomputers clearly shows that in this most complex area, investing in high technologies, as a rule, gives positive result- it is only necessary that the project be addressed to a fairly wide range of users and not contain too risky technical solutions. Convex, which, having received such an advantage at the start, began to develop successfully. First, it launched the Convex C-3200 family, the senior model of which, the C-3240, has a performance of 200 MFLOPS, and then the Convex C-3800 family, consisting of four basic models in one-, two-, four- and eight-processor configurations. The most powerful machine in this series, Convex C-3880, has performance worthy of a “real” supercomputer of the 80s, and when tested on the LINPACK package, it outperformed such systems as the IBM ES/9000-900 VF, ETA-10P and even CRAY in computing speed -1S. Note that Cray Research produces a mini-supercomputer CRAY Y-EL, also implemented on CMOS-VLSI technology. This computer comes in single-, dual-, or quad-processor configurations and delivers 133 MFLOPS of performance per processor. The amount of RAM varies depending on the wishes of the customer in the range of 256-1024 MB.

The dominance of vector supercomputers in government programs and the stable position of “king of the hill” occupied by Cray Research clearly did not suit supporters of MIMD parallelism. Initially, multiprocessor mainframes were included in this class, and subsequently third-generation supercomputers with a multiprocessor structure were added to them. Both are based on the principle formulated by von Neumann of controlling the computational process according to program commands, or controlling the flow of commands (Instruction Flow). However, from about the mid-60s, mathematicians began to discuss the problem of dividing a problem into a large number of parallel processes, each of which can be processed independently of the others, and the execution of the entire task is controlled by transferring data from one process to another. This principle, known as Data Flow, looks very promising in theory. DataFlow parallelism theorists assumed that the system could be organized from small and therefore cheap processors of the same type. Achieving ultra-high performance was entirely the responsibility of the compiler, which parallelizes the computing process, and the OS, which coordinates the functioning of the processors. The apparent simplicity of the MIMD parallelism principle has given rise to many projects.

Among the most well-known developments of MIMD class systems, it is worth mentioning IBM RP3 (512 processors, 800 MFLOPS), Cedar (256 processors, 3.2 GFLOPS; a computer from the same company), nCUBE/10 (1024 processors, 500 MFLOPS) and FPS-T (4096 processors, 65 GFLOPS). Unfortunately, none of these projects was a complete success and none of the systems mentioned showed the advertised performance. The point is that, as with matrix SIMD supercomputers, there are too many technical and software problems was associated with the organization of a switch that ensures data exchange between processors. In addition, the processors that make up the MIMD system turned out to be not so small and cheap in practice. As a result, the increase in their number led to such an increase in the size of the system and lengthening of interprocessor connections that it became quite obvious: with the level of elemental base that existed in the late 80s, the implementation of the MIMD architecture cannot lead to the emergence of systems capable of competing with vector supercomputers.

An extraordinary solution to the problem of the switching network of MIMD system processors was proposed by the little-known company Denelcor, which developed the HEP-1 multiprocessor model. This supercomputer was conceived as a MIMD system containing from 1 to 16 executive processing elements and up to 128 data memory banks of 8 MB each. A system of 16 processors was supposed to have a maximum performance of 160 MFLOPS with parallel processing of 1024 processes (64 processes in each of 16 PEs). A curious architectural feature of HEP-1 was that MIMD processing of many processes was performed without the use of a switch network, which was replaced by the so-called “Flynn spinner”.

Let us recall that the idea of the “Flynn turntable” is to organize a multiprocessor as a nonlinear system consisting of a group of command processors (CP), each of which “conducts” its own command flow, and a common set of arithmetic devices for all PCs, cyclically connected to each of the PCs to carry out their commands. It is easy to see that the effect of the “Flynn pinwheel” is to reduce the volume occupied by arithmetic devices in multiprocessor system, since “arithmetic” can account for up to 60% of the hardware resources of the central processor.

At first glance, the structure of HEP-1 is practically no different from the classic “Flynn turntable” - the same cyclic launch of commands belonging to different processes, and the same arithmetic devices common to many processes. However, at the input of executive devices, it is not command processors that are switched, but processes using a special mechanism for sampling, saving and restoring the status words of each executing process. Second, HEP-1 uses pipelined execution units, which allows arithmetic units to process significantly more operations than mainframe prototypes. It would seem that a solution has finally been found that combines the advantages

Similar abstracts:

Classifications of computer system architectures. Organization of computer systems. Central processing unit device. Principles of development of modern computers. The evolution of microprocessor systems. Increase in the number and composition of functional devices.

Abstract models and methods of parallel data processing, permissible error of calculations. Concept parallel process, their synchronization and parallelization granules, definition of Amdahl's law. Architecture of multiprocessor computing systems.

Classification of parallel aircraft. Systems with shared and distributed memory. Operations pipelines. Performance of an ideal conveyor. Superscalar architectures. VLIW architecture. Transition prediction. Matrix processors. Amdahl's and Gustafson's laws.

Generalized structure of the central processor. Main characteristics and classification of control devices. Arithmetic logic unit structure for adding, subtracting, and multiplying fixed-point numbers. Parallel computing systems.

MINISTRY OF EDUCATION OF THE RF SOUTH RUSSIAN STATE TECHNICAL UNIVERSITY (Novocherkassk Polytechnic Institute) Faculty: Information technologies and Management

Matrix algebra: specifying numerical and symbolic elements of a vector and matrix with and without using templates, using vector and matrix operators and functions. Operations of multiplication and division of a vector and matrix by each other and by scalar numbers.

Classification of computers. Classification of computers according to the principle of operation. Classification of computers by stages of creation. Classification of computers by purpose. Classification of computers by size and functionality. Main types of computers.

Supermachines: the history of the development of supercomputers. Supercomputers in Russia

Areas of application of supercomputers

Supercomputers in Russia

Architecture of modern supercomputers

Vector supercomputers

Multiprocessor vector supercomputers (MIMD)

Multiprocessor SMP servers based on RISC architecture microprocessors

Clusters

MPP-system (MIMD)

Supercomputer performance estimates

***

Literature

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts