Eighth version of Intel compilers. Sharing Intel and GCC compilers

Youtube

In the previous issue of the magazine, we discussed the products of the Intel VTune Performance Analyzer family - performance analysis tools that are deservedly popular among application developers and allow you to detect commands in application code that are wasting too many processor resources, which gives developers the opportunity to identify and eliminate potential bottlenecks associated with with similar sections of code, thereby speeding up the application development process. Note, however, that the performance of applications largely depends on how efficient the compilers used to develop them are, and what hardware features they use to generate machine code.

The latest versions of the Intel Intel C++ and Intel Fortran compilers for Windows and Linux operating systems allow you to gain up to 40% in application performance for systems based on Intel Itanium 2, Intel Xeon and Intel Pentium 4 processors compared to existing compilers from other manufacturers through the use of such features of these processors, such as Hyper-Threading technology.

Differences associated with code optimization by this family of compilers include the use of a stack for performing floating-point operations, interprocedural optimization (IPO), optimization in accordance with the application profile (Profile Guided Optimization (PGO), preloading data into the cache (Data prefetching), which avoids latency associated with memory access, support for the characteristic features of Intel processors (for example, extensions for streaming data processing Intel Streaming SIMD Extensions 2, characteristic of the Intel Pentium 4), automatic parallelization of code execution, application creation, running on several different types of processors when optimized for one of them, tools for “predicting” subsequent code (branch prediction), extended support for working with execution threads.

Note that Intel compilers are used in such well-known companies as Alias/Wavefront, Oracle, Fujitsu Siemens, ABAQUS, Silicon Graphics, IBM. According to independent testing conducted by a number of companies, the performance of Intel compilers is significantly higher than the performance of compilers from other manufacturers (see, for example, http://intel.com/software/products/compilers/techtopics/compiler_gnu_perf.pdf).

Below we will look at some of the features of the latest versions of Intel compilers for desktop and server operating systems.

Compilers for the Microsoft Windows platform

Intel C++ Compiler 7.1 for Windows

Intel C++ Compiler 7.1 is a compiler released earlier this year that provides highly optimized code for the Intel Itanium, Intel Itanium 2, Intel Pentium 4 and Intel Xeon processors, as well as the Intel Pentium M processor using Intel Centrino technology and intended for use in mobile devices.

The specified compiler is fully compatible with the development tools Microsoft Visual C++ 6.0 and Microsoft Visual Studio .NET: it can be built into the corresponding development environments.

This compiler supports ANSI and ISO C/C++ standards.

Intel Fortran Compiler 7.1 for Windows

Intel Fortran Compiler 7.1 for Windows, also released earlier this year, allows you to create optimized code for Intel Itanium, Intel Itanium 2, Intel Pentium 4 and Intel Xeon, Intel Pentium M processors.

This compiler is fully compatible with Microsoft Visual C++ 6.0 and Microsoft Visual Studio .NET development tools, that is, it can be built into the corresponding development environments. In addition, this compiler allows you to develop 64-bit applications for operating systems running on Itanium/Itanium 2 processors using Microsoft Visual Studio on a 32-bit Pentium processor using the 64-bit Intel Fortran Compiler. When debugging code, this compiler allows you to use a debugger for the Microsoft .NET platform.

If you have Compaq Visual Fortran 6.6 installed, you can use it instead of the original Intel Fortran Compiler 7.1, since these compilers are compatible at the source code level.

Intel Fortran Compiler 7.1 for Windows is fully compatible with the ISO Fortran 95 standard and supports the creation and debugging of applications containing code in two languages: C and Fortran.

Compilers for the Linux platform

Intel C++ Compiler 7.1 for Linux

Another compiler that was released at the beginning of the year, Intel C++ Compiler 7.1 for Linux, allows you to achieve a high degree of code optimization for Intel Itanium, Intel Itanium 2, Intel Pentium 4, Intel Pentium M processors. This compiler is fully compatible with the GNU C compiler at the source code and object modules, which allows applications created using GNU C to be migrated to it at no additional cost. The Intel C++ Compiler supports the C++ ABI (an addition to the Linux kernel that allows compiled code for other platforms, such as early SCO operating systems, early versions of Sun Solaris, etc.), which means full compatibility with the gcc 3.2 compiler at the binary code level. Finally, with Intel C++ Compiler 7.1 for Linux, you can even recompile the Linux kernel by making a few minor changes to its source code.

Intel Fortran Compiler 7.1 for Linux

Intel Fortran Compiler 7.1 for Linux allows you to create optimized code for Intel Itanium, Intel Itanium 2, Intel Pentium 4, Intel Pentium M processors. This compiler is fully compatible with the Compaq Visual Fortran 6.6 compiler at the source code level, allowing you to recompile applications using it created using Compaq Visual Fortran, thereby increasing their performance.

In addition, the specified compiler is compatible with such utilities used by developers as the emacs editor, the gdb debugger, and the make application build utility.

Like the Windows version of this compiler, Intel Fortran Compiler 7.1 for Linux is fully compatible with the ISO Fortran 95 standard and supports the creation and debugging of applications containing code in two languages: C and Fortran.

It should be especially emphasized that a significant contribution to the creation of the listed Intel compilers was made by specialists from the Intel Russian Software Development Center in Nizhny Novgorod. More information about Intel compilers can be found on the Intel Web site at www.intel.com/software/products/.

The second part of this article will be devoted to Intel compilers that create applications for mobile devices.

IntroductionIn late 2003, Intel introduced version 8.0 of its compiler collection. The new compilers are designed to improve the performance of applications running on servers, desktops and mobile systems (laptops, mobile phones and PDAs) based on Intel processors. It is pleasant to note that this product was created with the active participation of employees of the Nizhny Novgorod Intel Software Development Center and Intel specialists from Sarov.

The new series includes Intel C++ and Fortran compilers for Windows and Linux, as well as Intel C++ compilers for Windows CE .NET. The compilers are targeted at systems based on the following Intel processors: Intel Itanium 2, Intel Xeon, Intel Pentium 4, Intel Personal Internet Client Architecture processors for mobile phones and Pocket PCs, and Intel Pentium M processor for mobile PCs (a component of Intel Centrino technology for mobile PC).

The Intel Visual Fortran Compiler for Windows provides next-generation compilation technologies for high-performance computing solutions. It combines the functionality of the Compaq Visual Fortran (CVF) language with the performance improvements made possible by Intel's compilation and code generation optimization technologies, and simplifies the task of porting source code developed with CVF to the Intel Visual Fortran environment. This compiler is the first to implement CVF functions for both 32-bit Intel systems and systems based on Intel Itanium family processors running in a Windows environment. In addition, this compiler allows you to implement CVF language functions on Linux systems based on 32-bit Intel processors and Intel Itanium family processors. In 2004, it is planned to release an expanded version of this compiler - the Intel Visual Fortran Compiler Professional Edition for Windows OS, which will include the IMSL Fortran 5.0 Library developed by Visual Numerics, Inc.

"The new compilers also support future Intel processors, codenamed Prescott, which feature new graphics and video performance commands and other performance enhancements. They also support the new Mobile MMX(tm) technology, which similarly improves graphics performance. , audio and video applications for mobile phones and pocket PCs," said Alexey Odinokov, co-director of the Intel Center for Software Development in Nizhny Novgorod. "These compilers provide application developers with a single set of tools for building new applications for wireless networks based on Intel architecture. New Intel compilers also support Intel's Hyper-Threading technology and the OpenMP 2.0 industry specification, which defines the use of high-level directives to control instruction flow in applications."

New tools included in the compilers include Intel Code Coverage and Intel Test Prioritization. Together, these tools enable you to accelerate application development and improve application quality by improving the software testing process.

The Code Coverage tool provides complete information about the use of application logic and the location of the used areas in the application source code during application testing. If changes are made to the application or if this test does not allow checking the part of the application that interests the developer, the Test Prioritization tool allows you to check the operation of the selected section of the program code.

New Intel compilers are available in different configurations, costing from $399 to $1,499. They can be purchased today from Intel or from resellers around the world, a list of which is located on the website http://www.intel.com/software/products/reseller.htm#Russia.

Prescott processor support

Support for the Intel Pentium 4 (Prescott) processor in the eighth version of the compiler is as follows:

1. Support for SSE3 commands (or PNI, Prescott New Instructions). There are three ways to distinguish here:

A. Assembly inserts (Inline assembly). For example, the compiler recognizes the following use of the SSE3 command _asm(addsubpd xmm0, xmm1). This way, users interested in low-level optimization can gain direct access to assembly commands.

b. In the C/C++ compiler, new instructions are available from a higher level than the use of assembly inserts. Namely, through built-in functions (intrinsic functions):

Built-in functions

Built-in function	Generated command
_mm_addsub_ps	Addsubps
_mm_hadd_ps	Haddps
_mm_hsub_ps	Msubps
_mm_moveldup_ps	Movsldup
_mm_movehdup_ps	Movshdup
_mm_addsub_pd	Addsubpd
_mm_hadd_pd	Haddpd
_mm_hsub_pd	Hsubpd
_mm_loaddup_pd	movddup xmm, m64
_mm_movedup_pd	movddup reg, reg
_mm_lddqu_si128	Lddqu

The table shows the built-in functions and corresponding assembly instructions from the SSE3 set. The same support exists for commands from the MMX\SSE\SSE2 sets. This allows the programmer to perform low-level code optimization without resorting to assembly language programming: the compiler itself takes care of mapping built-in functions to the corresponding processor instructions and optimal use of registers. The programmer can concentrate on creating an algorithm that efficiently uses new instruction sets.

V. Automatic generation of new commands by the compiler. The previous two methods require the programmer to use new commands. But the compiler is also capable (using the appropriate options - see section 3 below) to automatically generate new commands from the SSE3 set for program code in C/C++ and Fortran. For example, the optimized unaligned loading command (lddqu), the use of which allows you to achieve a performance gain of up to 40% (for example, in video and audio encoding tasks). Other commands in the SSE3 set allow you to get significant speedup in 3D graphics tasks or calculation problems using complex numbers. For example, the graph in section 3.1 below shows that for application 168.wupwise from the SPEC CPU2000 FP suite, the speedup obtained from automatic SSE3 instruction generation was ~25%. The performance of this application depends significantly on the speed of complex number arithmetic.

2. Using the microarchitectural advantages of the Prescott processor. When generating code, the compiler takes into account microarchitectural changes in the new processor. For example, some operations (such as integer shifts, integer multiplication, or converting numbers between different floating-point formats in SSE2) are faster on the new processor relative to previous versions (for example, an integer shift now takes one processor cycle versus four for the previous version Intel Pentium 4 processor). More intensive use of such commands allows you to significantly speed up applications.
Another example of microarchitectural changes is the improved store forwarding mechanism (fast loading of data previously stored in memory); actual storage occurs not even in cache memory, but in some intermediate storage buffer, which then allows very fast access to data. This feature of the architecture makes it possible, for example, to implement more aggressive automatic vectorization of program code.
The compiler also takes into account the increased size of the first and second level cache.

3. Improved support for Hyper-Threading technology. This point may well be related to the previous one - microarchitectural changes and their use in the compiler. For example, the runtime library that implements support for the OpenMP industry specification has been optimized to run on the new processor.

Performance

Using compilers is a simple and effective way to take advantage of Intel processor architectures. Below, conditionally (very) two ways of using compilers are highlighted: a) recompilation of programs with a possible change in the compiler settings, b) recompilation with a change in both the compiler settings and the source text, as well as the use of compiler diagnostics based on optimizations being carried out and the possible use of other software tools ( for example, profilers).

1.1 Optimizing programs using recompilation and changing compiler settings

Often the first step in migrating to a new optimizing compiler is to use it with its default settings. The next logical step is to use options for more aggressive optimization. Figures 1, 2, 3 and 4 show the effect of switching to the Intel compiler version 8.0 compared to using other industry-leading products (-O2 - default compiler settings, base - settings for maximum performance). The comparison is made on 32- and 64-bit Intel architectures. Applications from SPEC CPU2000 are used as a test set.

Picture 1

Figure 2

Figure 3

Figure 4

Some options are listed below (the following options are for the Windows OS family; for the Linux OS family there are options with the same effect, but the name may differ; for example, -Od or QxK for Windows has a similar effect to -O0 or -xK for Linux respectively; more information can be found in the compiler manual) supported by the Intel compiler.

Controlling optimization levels: Options -Od (no optimizations; used for debugging programs), -O1 (maximum speed while minimizing code size), -O2 (optimization for code execution speed; applied by default), -O3 (enables the most aggressive optimizations for code execution speed ; in some cases it can lead to the opposite effect, i.e. to a slowdown; it should be noted that on the IA-64 the use of -O3 leads to acceleration in most cases, while the positive effect on the IA-32 is less pronounced). Examples of optimizations enabled by -O3: loop interchange, loop fusion, loop distribution (optimization, inverse loop fusion), software prefetch of data. The reason why there may be slowdown when using -O3 may be that the compiler used a heuristic approach to selecting aggressive optimizations for a particular case, without having sufficient information about the program (for example, it generated prefetch instructions for the data used in the loop, believing that that the loop is executed a large number of times, when in fact it only has a few iterations). Interprocedural optimization for profiling, as well as various programmer “tips” (see section 3.2) can help in this situation.

Interprocedural optimization: -Qip (within one file) and -Qipo (within several or all project files). Includes optimizations such as, for example, inline substitution of frequently used code (reducing the cost of calling a function/procedure). Provides information to other optimization stages - for example, information about the loop upper bound (say, if it is a compile-time constant defined in one file but used in many) or information about data alignment in memory (many MMX\SSE\SSE2\SSE3 commands work faster if the operands are aligned in memory to an 8- or 16-byte boundary). Analysis of memory allocation procedures (implemented/called in one of the project files) is passed to those functions/procedures where this memory is used (this can help the compiler to abandon the conservative assumption that the data is not properly aligned in memory; and the assumption should be conservative when lack of additional information). Another example is disambiguation, data aliasing analysis: in the absence of additional information and the impossibility of proving the absence of intersections, the compiler makes a conservative assumption that there are intersections. Such a decision may negatively affect the quality of optimizations such as automatic vectorization on the IA-32 or software pipelining (SWP) on the IA-64. Interprocedural optimization can help analyze the presence of memory intersections.

Optimization by profiling: Includes three stages. 1) generation of instrumented code using the -Qprof_gen option. 2) the resulting code is run on representative data, while information is collected about various characteristics of code execution (for example, transition probabilities or a typical value for the number of loop iterations). 3) Recompilation with the -Qprof_use option, which ensures that the compiler uses the information collected in the previous step. Thus, the compiler is able to use not only static estimates of important program characteristics, but also data obtained during the actual execution of the program. This can help with subsequent selection of certain optimizations (for example, more efficient arrangement of different branches of the program in memory, based on information about which branches were executed at what frequency; or applying optimizations to a loop based on information about the typical number of iterations in it) . Optimization by profiling is especially useful in cases where it is possible to select a small but representative set of data (for step #2) that well illustrates the most typical cases of future use of the program. In some subject areas, choosing such a representative set is entirely possible. For example, profiling optimization is used by DBMS developers.

The optimizations listed above are of the generic type, i.e. the generated code will work on all different processors of the family (say, in the case of 32-bit architecture - on all of the following processors: Intel Pentium-III, Pentium 4, including the Prescott core, Intel Pentium M). There are also optimizations for specific processors.

Processor-specific optimizations: -QxK (Pentium-III; use of SSE commands, microarchitecture features), -QxW and -QxN (Pentium 4; use of SSE and SSE2 commands, microarchitecture features), -QxB (Pentium M; use of SSE and SSE2 commands, microarchitecture features ), QxP (Prescott; use of SSE, SSE2, and SSE3 commands, microarchitecture features). In this case, code generated using such options may not work on other representatives of the processor line (for example, -QxW code may result in the execution of an invalid command if executed on a system based on an Intel Pentium-III processor). Or not work with maximum efficiency (for example, -QxB code on a Pentium 4 processor due to differences in microarchitecture). With these options, it is also possible to use runtime libraries optimized for a specific processor using its instruction set. To control that the code is actually executed on the target processor, a dispatch mechanism (cpu-dispatch) is implemented: checking the processor during program execution. In different situations, this mechanism can either be activated or not. Dispatch is always used if the -Qax(KWNP) option variation is used. In this case, two versions of the code are generated: optimized for a specific processor and “general” (generic), the choice occurs during program execution. Thus, by increasing the code size, it is possible to achieve program execution on all processors of the line and optimal execution on the target processor. Another option is to use code optimization for the previous representative of the line and use this code on this and subsequent processors. For example, -QxN code can run on a Pentium 4 with either a Northwood or Prescott core. There is no increase in code size. With this approach, you can get good, but still not optimal performance on a system with a Prescott processor (since SSE3 is not used and differences in microarchitecture are not taken into account) with optimal performance on Northwood. Similar options also exist for IA-64 architecture processors. There are currently two of them: -G1 (Itanium) and -G2 (Itanium 2; default option).

The graph below (Figure 5) shows the speedup (based on one - the absence of any speedup) from using some of the optimizations listed above (namely -O3 -Qipo -Qprof_use -Qx(N,P)) on the Prescott processor compared with default settings (-O2). Using -QxP helps in some cases to get a speedup compared to -QxN. The greatest speedup is achieved in the 168.wupwise application, already mentioned in the previous section (due to intensive optimization of complex arithmetic using SSE3 instructions).

Figure 5

Figure 6 below shows the ratio (in times) of the speed of code with optimal settings compared to completely unoptimized code (-Od) on Pentium 4 and Itanium 2 processors. It can be seen that Itanium 2 is much more dependent on the quality of optimization. This is especially pronounced for floating point (FP) calculations, where the ratio is approximately 36 times. Floating-point calculations are a strength of the IA-64 architecture, but care must be taken to use the most efficient compiler settings. The resulting gain in productivity pays for the labor costs of searching for them.

Figure 6. Speedup with Best SPEC CPU200 Optimization Options

Intel compilers support the OpenMP industry specification for creating multi-threaded applications. Explicit (option -Qopenmp) and automatic (-Qparallel) parallelization modes are supported. In the case of explicit mode, the programmer is responsible for the correct and efficient use of OpenMP standard tools. In the case of automatic parallelization, the compiler has an additional burden associated with analyzing the program code. For this reason, at present, automatic parallelization works effectively only on fairly simple codes.

The graph in Figure 7 shows the acceleration from using explicit parallelization on a pre-production sample system based on an Intel Pentium 4 (Prescott) processor with support for Hyper-Threading technology: 2.8GHz, 2GB RAM, 8K L1-Cache, 512K L2-Cache . The test suite used is SPEC OMPM2001. This set is aimed at small and medium SMP systems, memory consumption is up to two gigabytes. Applications were compiled using Intel 8.0 C/C++ and Fortran with two sets of options: -Qopenmp -Qipo -O3 -QxN and -Qopenmp -Qipo -O3 -QxP, each of which ran applications with Hyper-Threading technology enabled and disabled. The acceleration values in the graph are normalized to the performance of the single-threaded version with Hyper-Threading technology disabled.

Figure 7: SPEC OMPM2001 Applications on Prescott Processor

It can be seen that in 9 out of 11 cases, using explicit parallelization using OpenMP gives a performance increase when Hyper-Threading technology is enabled. One of the applications (312.swim) is experiencing a slowdown. It is a known fact that this application is highly dependent on memory bandwidth. Just like in the case of SPEC CPU2000, the wupwise application greatly benefits from applying optimizations for Prescott (-QxP).

1.2 Optimizing programs by making changes to the source text and using compiler diagnostics

In previous sections, we looked at the influence of the compiler (and its settings) on the speed of code execution. At the same time, Intel compilers provide broader opportunities for code optimization than just changing settings. In particular, compilers enable the programmer to make "hints" in the program code, which allow the generation of more efficient code in terms of performance. Below are some examples for the C/C++ language (for the Fortran language there are similar tools that differ only in syntax).

#pragma ivdep (where ivdep stands for ignore vector dependencies) is used before program loops to tell the compiler that there are no data dependencies inside. This hint works in the case when the compiler (based on analysis) conservatively assumes that such dependencies may exist (if the compiler, as a result of analysis, can prove that the dependency exists, then the “hint” has no effect), while the author of the code knows that such dependencies cannot arise. With this hint, the compiler can generate more efficient code: automatic vectorization for IA-32 (using vector instructions from the MMX\SSE\SSE2\SSE3 sets for program loops in C/C++ and Fortran; you can get acquainted with this technique in more detail, for example, in the next article in Intel Technology Journal), software pipelining (SWP) for the IA-64.

#pragma vector always is used so that the compiler changes the decision about the inefficiency of loop vectorization (both automatic for IA-32 and SWP for IA-64), made based on an analysis of the quantitative and qualitative characteristics of the work at each iteration.

#pragma novector has the opposite effect of #pragma vector always.

#pragma vector aligned is used to tell the compiler that the data used in the loop is aligned to a 16-byte boundary. This allows you to generate more efficient and/or compact (due to the lack of runtime checks) code.

#pragma vector unaligned has the opposite effect of #pragma aligned. It’s difficult to talk about performance gains in this case, but you can count on more compact code.

#pragma distribute point is used inside a program loop so that the compiler can split the loop (loop distribution) at this point into several smaller ones. For example, such a "hint" can be used in the case where the compiler fails to automatically vectorize the source loop (for example, due to a data dependency that cannot be ignored even with #pragma ivdep), whereas each (or part) of the newly formed cycles can be effectively vectorized.

#pragma loop count (N), is used to tell the compiler that the most likely value for the number of iterations of the loop will be N. This information helps decide the most effective optimization for this loop (for example, whether to unroll, whether to SWP or automatic vectorization, is it necessary to use software data prefetch commands, ...)

The "hint" _assume_aligned(p, base) is used to tell the compiler that the memory region associated with pointer p is aligned to a boundary of base = 2^n bytes.

This is not a complete list of various compiler "hints" that can significantly affect the efficiency of the generated code. You may wonder how to determine that the compiler needs a hint.

First, you can use compiler diagnostics in the form of reports that it provides to the programmer. For example, using the -Qvec_reportN option (where N ranges from 0 to 3 and represents the level of detail), you can obtain an automatic vectorization report. The programmer will have access to information about which loops have been vectorized and which have not. In the negative case, the compiler indicates in the report the reasons why the vectorization failed. Let us assume that the cause was a conservatively assumed relationship in the data. In this case, if the programmer is sure that a dependency cannot arise, then #pragma ivdep can be used. The compiler provides similar (compared to Qvec_reportN for IA-32) capabilities on IA-64 to monitor the presence and effectiveness of SWP. In general, Intel compilers provide extensive capabilities for diagnosing optimizations.

Second, other software products (such as the Intel VTune profiler) can be used to find performance bottlenecks in the code. The results of the analysis can help the programmer make necessary changes.

You can also use the assembly code listing generated by the compiler for analysis.

Figure 8

Figure 8 above shows a step-by-step process for optimizing an application using the Intel Fortran compiler (and other software) for the IA-64 architecture. As an example, we consider the non-adiabatic regional forecast scheme for 48 hours of Roshydrometcenter (you can read about it, for example, in this article. The article talks about a calculation time of about 25 minutes, but significant changes have occurred since it was written. The performance of the code is taken as a reference point on a Cray-YMP system. Unmodified code with default compiler options (-O2) showed a 20% performance increase on a four-processor system based on an Intel Itanium 2900 MHz processor. Application of a more aggressive optimization (-O3) resulted in a speedup of ~2.5 times without code changes, mainly due to SWP and data prefetching. Analysis using compiler diagnostics and Intel VTune profiler revealed some bottlenecks. For example, the compiler did not software pipeline several performance-critical loops, reporting in the report that it suggests data dependency Small code changes (ivdep directive) helped achieve efficient pipelining. Using the VTune profiler, we were able to discover (and the compiler report confirmed this) that the compiler did not change the order of nested loops (loop interchange) for more efficient use of cache memory. The reason was again the conservative assumptions about the dependence in the data. Changes have been made to the source code of the program. As a result, we managed to achieve a 4-fold acceleration compared to the initial version. Using explicit parallelization using OpenMP directives, and then moving to a system with higher frequency processors, the calculation time was reduced to less than 8 minutes, which was more than 16 times faster than the initial version.

Intel Visual Fortran

Intel Visual Fortran 8.0 uses front-end (the part of the compiler responsible for converting a program from text in a programming language into an internal compiler representation, which is largely independent of either the programming language or the target machine) CVF compiler technologies and components of the Intel compiler, responsible for a set of optimizations and code generation.

Figure 9

Figure 10

Figures 9 and 10 show graphs comparing the performance of Intel Visual Fortran 8.0 with the previous version of Intel Fortran 7.1 and with other industry-popular compilers from this language running under Windows and Linux operating systems. For comparison, tests were used, the source texts of which, meeting the F77 and F90 standards, are available on the website http://www.polyhedron.com/. On the same site, more detailed information is available on comparing the performance of compilers (Win32 Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks and Linux Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks): more different compilers are shown, and the geometric mean is given in combination with the individual results of each test.

Examples of real hacks: Intel C++ 7.0 Compiler - Archive WASM.RU

...the Intel C++ 7.0 compiler downloaded late at night, at about five in the morning. I really wanted to sleep, but I was also torn by curiosity: whether the protection had been strengthened or not. Deciding that until I figured out the protection I still wouldn’t sleep, I opened a new console and reinstalled the TEMP and TMP system variables in the C:\TEMP directory and hastily typed the indecently long installer name W_CC_P_7.0.073.exe in the command line line (the need to set the TEMP and TMP variables is explained by the fact that in Windows 2000 they by default point to a very deeply nested directory, and the Intel C++ installer - and not only it - does not support paths of such a huge size).
It immediately became clear that the protection policy had been radically revised and now the presence of a license was checked already at the stage of installing the program (in version 5.x the installation was carried out without problems). OK, we give the dir command and look at the contents of what we now have to fight with:
Contents of the folder C:\TMP\IntelC++Compiler70

17.03.2003 05:10
html

17.03.2003 05:11
x86

17.03.2003 05:11
Itanium

17.03.2003 05:11
notes

06/05/2002 10:35 45 056 AutoRun.exe

07/10/2001 12:56 27 autorun.inf

29.10.2002 11:25 2 831 ccompindex.htm

10/24/2002 08:12 126 976 ChkLic.dll

10/18/2002 22:37 552 960 chklic.exe

10/17/2002 16:29 28 663 CLicense.rtf

17.10.2002 16:35 386 credist.txt

16.10.2002 17:02 34 136 Crellnotes.htm

03/19/2002 14:28 4 635 PLSuite.htm

02/21/2002 12:39 2 478 register.htm

02.10.2002 14:51 40 960 Setup.exe

02.10.2002 10:40 151 Setup.ini

07/10/2001 12:56 184 setup.mwg

19 files 2,519,238 bytes

6 folders 886,571,008 bytes free

Yeah! The setup.exe installation program takes only forty-odd kilobytes. Very good! It is unlikely that you can hide serious protection in such a volume, and even if so, this tiny file is worth nothing to analyze in its entirety - down to the last byte of the disassembler listing. However, it is not a fact that the security code is located specifically in setup.exe; it can be located in another place, for example... ChkLic.dll/ChkLic.exe, which together occupy a little less than seven hundred kilobytes. Wait, what is ChkLic? Is this short for Check License or what?! Hmm, the guys from Intel obviously have serious problems with a sense of humor. It would be better if they called this file "Hack Me" honestly! Okay, judging by the volume, ChkLic is the same FLEX lm, and we have already encountered it (see "Intel C++ 5.0 Compiler") and have a rough idea of how to break it.
We give the command “dumpbin /EXPORTS ChkLic.dll” to examine the exported functions and... hold on tightly to Klava so as not to fall off the chair:
Dump of file ChkLic.dll

Section contains the following exports for ChkLic.dll

0 characteristics

3DB438B4 time date stamp Mon Oct 21 21:26:12 2002

1 number of functions

1 number of names

ordinal hint RVA name

1 0 000010A0 _CheckValidLicense

Damn it! The protection exports just one single function with the wonderful name CheckValidLicense. “Wonderful” - because the purpose of the function becomes clear from its name and it becomes possible to avoid painstaking analysis of disassembler code. Well, they've lost all interest... it would be better if they exported it ordinally or something, or at least christened it with some kind of intimidating name like DES Decrypt.
...daydreaming! Okay, let's get back to our sheep. Let's think logically: if all the security code is concentrated directly in ChkLic.dll (and, judging by the “hinged” nature of the protection, this is indeed the case), then all “protection” comes down to calling CheckValidLicense from Setup.exe and checking the result returned by it. Therefore, to “hack” it is enough just to lose ChkLic.dll, forcing the ChekValidLicense function to always return... and, by the way, what should it return? More precisely: what exactly is the return value that corresponds to a successful license check? No, don’t rush to disassemble setup.exe to determine it, because there are not so many possible options: either FALSE or TRUE. Are you betting on TRUE? Well, in a sense, this is logical, but on the other hand: why did we actually decide that the CheckValidLicense function returns exactly the flag of the success of the operation, and not the error code? After all, it must somehow motivate the reasons for refusing to install the compiler: the file with the license was not found, the file is damaged, the license has expired, and so on? Okay, let's try to return zero, and if that doesn't work, we'll return one.
OK, buckle up, let's go! We launch HIEW, open the file ChkLic.dll (if it doesn’t open, remember the gophers three times, temporarily copy it to the root or any other directory that does not contain special characters in its name that hiew doesn’t like). Then, turning again to the export table obtained using dumpbin, we determine the address of the CheckValidLicense function (in this case 010A0h) and through “10A0” we go to its beginning. Now, we cut it “live”, overwriting over the old code “XOR EAX, EAX/RETN 4". Why exactly "REN 4" and not just "RET"? Yes, because the function supports the stdcall convention, which you can find out by looking at its epilogue in HIEW"e (just scroll down the disassembler screen until meet RET).
Let's check... It works!!! Despite the lack of a license, the installer begins the installation without asking any questions! Therefore, the defense fell. Oh, we can’t believe that everything is so simple and, in order not to sit stupidly staring at the monitor waiting for the program installation process to complete, we use our favorite IDA disassembler on setup.exe. The first thing that catches your eye is the absence of CheckValidLicense in the list of imported functions. Maybe it somehow launches the ChkLic.exe file? We try to find the corresponding link among the automatically recognized lines: “~View aNames”, “ChkLic”... yeah, the line “Chklic.exe” is not there at all, but “Chklic.dll” is detected. Yeah, I see, that means the ChkLic library is loaded by explicit linking via LoadLibrary. And following the cross-reference confirms this:
Text:0040175D push offset aChklic_dll ; lpLibFileName

Text:00401762 call ds:LoadLibraryA

Text:00401762 ; load ChkLic.dll ^^^^^^^^^^^^^^^^^

Text:00401762 ;

Text:00401768 mov esi, eax

Text:0040176A push offset a_checkvalidlic ; lpProcName

Text:0040176F push esi ; hModule

Text:00401770 call ds:GetProcAddress

Text:00401770 ; get the address of the CheckValidLicense function

Text:00401770 ;

Text:00401776 cmp esi, ebx

Text:00401778 jz loc_40192E

Text:00401778 ; if there is no such library, then exit the installation program

Text:00401778 ;

Text:0040177E cmp eax, ebx

Text:00401780 jz loc_40192E

Text:00401780 ; if there is no such function in the library, then exit the installation

Text:00401780 ;

Text:00401786 push ebx

Text:00401787 call eax

Text:00401787 ; call the ChekValidLicense function

Text:00401787 ;

Text:00401789 test eax, eax

Text:0040178B jnz loc_4019A3

Text:0040178 ; if the function returned non-zero, then exit the installation program
Incredibly, this terribly primitive defense is built exactly like this! Moreover, the half-meter file ChkLic.exe is not needed at all! And why was it worth dragging it from the Internet? By the way, if you decide to save the compiler distribution (attention: I didn’t say “distribute”!), then to save disk space, ChkLic.* can be erased: either by deleting setup.exe, forever weaning it from accessing them, or by simply creating your own ChkLic.dll, exporting the stdcall function CheckValidLicence of the form: int CheckValidLicence(int some_flag) ( return 0;)
Well, while we were discussing all this, the installer finished installing the compiler and successfully completed its work. Is it interesting whether the compiler will start or is all the fun just beginning? We feverishly go down the branched hierarchy of subfolders, find icl.exe, which, as one would expect, is located in the bin directory, click and... The compiler naturally does not start, citing the fact that "icl: error: could not checkout FLEX lm license" , without which he cannot continue his work.
It turns out that Intel used multi-level protection and the first level turned out to be a crude protection against fools. Well! We accept this challenge and, based on our previous experience, automatically look for the LMGR*.DLL file in the compiler directory. Useless! This time there is no such file here, but it turns out that icl.exe has gained a lot of weight, exceeding the six hundred kilobyte mark... Stop! Didn't the compiler developers link this same FLEX lm with static linking? Let's see: in Intel C++ 5.0, the sum of the sizes of lmgr327.dll and icl.exe was 598 KB, and now icl.exe alone occupies 684 KB. Taking into account the adjustment for natural senile “obesity”, the figures agree very well. So, after all, FLEX lm! Oh oh! But now, without symbolic function names, it will be much more difficult to break the protection... However, let’s not panic ahead of time! Let's think, just calmly! It is unlikely that the development team completely rewrote all the code that interacts with this “envelope” protection. Most likely, its “improvement” ended with just a change in the type of layout. And if so, then the chances of hacking the program are still great!
Remembering that last time the security code was in the main function, we, having determined its address, simply set a breakpoint and, waiting for the debugger to pop up, stupidly trace the code, alternately looking at the debugger, then at the program output window: has the Is there an expletive message? At the same time, we mark all the conditional transitions we encounter on a separate piece of paper (or put it in our own memory, if you want), not forgetting to indicate whether each conditional transition was performed or not... Stop! You and I were chatting about something, but an abusive message has already popped up! OK well! Let's see what conditional transition corresponded to it. Our records show that the last jump encountered was the JNZ conditional jump, located at address 0401075h and "reacting" to the result returned by sub_404C0E:

Text:0040107F loc_40107F: ; CODE XREF: _main+75^j

Text:0040107F mov eax, offset aFfrps ; "FFrps"

Text:00401084 mov edx, 21h

Text:00401089 call sub_404C0E

Text:0040108E test eax, eax

Text:00401090 jnz short loc_40109A

Obviously, sub_404C0E is the very protective procedure that checks the license for its presence. How to trick her? Well, there are many options... Firstly, you can thoughtfully and scrupulously analyze the contents of sub_404C0E to find out what exactly it checks and how exactly it checks. Secondly, you can simply replace JNZ short loc_40107F with JZ short loc_40107F or even NOP, NOP. Thirdly, the return result check command TEST EAX, EAX can be turned into a zero command: XOR EAX, EAX. Fourthly, you can make sub_404C0E itself disappear so that it always returns zero. I don’t know about you, but I liked method number three the most. We change two bytes and launch the compiler. If there are no other checks of its “licensing” in the protection, then the program will work and, accordingly, vice versa. (As we remember, in the fifth version there were two such checks). Amazingly, the compiler no longer complains and works!!! Indeed, as one would expect, its developers did not strengthen the protection at all, but, on the contrary, even weakened it! Chris Kaspersky

You are not a slave!
Closed educational course for children of the elite: "The true arrangement of the world."
http://noslave.org

Material from Wikipedia - the free encyclopedia

Intel C++ Compiler