Hello, dear readers of the blog site. Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern Unicode consortium encodings UTF 16 and 8.

To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

ASCII - basic text encoding for the Latin alphabet

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters.

But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as “aski”). It describes the first 128 characters most frequently used by English-speaking users - letters, Arabic numerals and punctuation marks.

These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order.

But the fact is that with one byte of information you can encode not 128, but as many as 256 different meanings(two to the power of eight equals 256), so following basic version A whole series of Asukas appeared extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

Here, it’s probably worth saying a little more about the number systems that are used in the description. First of all, as you all know, a computer only works with numbers in binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). , each of which is a two to the power, starting from zero, and up to two to the seventh:

It's not difficult to understand that everyone possible combinations There can only be 256 zeros and ones in this design. Converting a number from binary to decimal is quite simple. You just need to add up all the powers of two with ones above them.

In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). Total gets 233 in decimal system Reckoning. As you can see, everything is very simple.

But if you look closely at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds to the hexadecimal number 2A in Aski. You probably know that in hexadecimal system Numbers are used in addition to Arabic numerals and Latin letters from A (means ten) to F (means fifteen).

Well then, for converting binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. In each half byte, only sixteen values ​​(two to the fourth power) can be encoded in binary, which can easily be represented as a hexadecimal number.

Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. It became possible to add symbols of letters of your language to Aski.

Here we will need to digress again to explain - why do we need encodings at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector forms (representations) of various characters (they are located in files with ) and code that allows you to pull out from this set of vector forms (font file) exactly the character that will need to be inserted into Right place.

It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required font file, which is connected to display this text document. Everything is simple and banal.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, such options exist a whole bunch. Just for encoding Russian language characters, there are several varieties of extended Aska.

For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII.

Those. her top part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the form indicated in the screenshot just below and allowed you to encode another 128 characters (Russian letters and all sorts of pseudo-graphics):

You see, in the right column the numbers start with 8, because... numbers from 0 to 7 refer to the basic part of ASCII (see first screenshot). That. The Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will appear in the text.

Where did this amount come from? pseudographics in CP866? The whole point here is that this encoding for Russian text was developed back in those shaggy years when there was no such proliferation of graphic operating systems like now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article.

Among the features of the KOI8-R encoding, it can be noted that Russian letters in its table do not go in alphabetical order, as, for example, they did in CP866.

If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why the cracks come out

The further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose that, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic symbols.

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this would be.

It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in similar Russian Slavic languages(Ukrainian, Belarusian, etc.):

Due to such an abundance of Russian language encodings, font manufacturers and manufacturers software headaches constantly arose, and you and I, dear readers, often got those same notorious krakozyabry when there was confusion with the version used in the text.

Very often they came out when sending and receiving messages by e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem fundamentally, and users often used for correspondence to avoid the notorious gimmicks when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, the krakozyabs appearing instead of the Russian text were the result of incorrect use of the encoding of this language, which did not correspond to the one in which it was encoded text message initially.

For example, if you try to display characters encoded using CP866 using the code Windows table 1251, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often arises on forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong encoding text editor, which adds gags to the code that are not visible to the naked eye.

In the end, many people got tired of this situation with a lot of encodings and constantly creeping out crap, and the prerequisites appeared for the creation of a new universal variation that would replace all the existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where there were much more language characters than 256.

Unicode - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not possibly be described in one byte of information that was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding.

The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding.

As a result, the same file with text encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a colossal margin).

But for many countries with languages ​​of the European group this great amount There was no need to use characters in the encoding at all, but when UTF-32 was used, they would never have received a fourfold increase in weight text documents, and as a result, an increase in the volume of Internet traffic and the amount of stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks.

In the operating room Windows system you can follow the path “Start” - “Programs” - “Accessories” - “Service” - “Character table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select in " Additional options» set of Unicode characters, you can see for each font separately the entire range of characters included in it.

By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and this is the number that was adopted as the base space in Unicode. In addition, there are ways to encode about two million characters using it, but they were limited to an expanded space of a million characters of text.

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English, because for them, after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes for the same character in YUTF-16).

It was precisely to satisfy everyone and everything in the Unicode consortium that it was decided to come up with variable length encoding. It was called UTF-8. Despite the eight in the name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length.

In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII.

What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters are encoded in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now they even come in sets.

In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in the fact that they were created for different encodings, but in the fact that the font manufacturer has filled or not completely filled the single code space with certain vector forms.

Crazy words instead of Russian letters - how to fix it

Let's now see how krakozyabrs appear instead of text or, in other words, how the correct encoding for Russian text is selected. Actually, it is set in the program in which you create or edit this very text, or code using text fragments.

To edit and create text files, I personally use a very good, in my opinion, . However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read a detailed review of this wonderful program at the link provided.

In the top menu of Notepad++ there is an item “Encodings”, where you will have the opportunity to convert an existing option to the one used by default on your site:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should select the option to avoid the appearance of cracks UTF 8 without BOM. What is the BOM prefix?

The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents.

In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs.

What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out.

Therefore, under no circumstances don't use regular Windows Notepad to edit documents on your site if you don’t want any cracks to appear. I consider the already mentioned Notepad++ editor to be the best and simplest option, which has practically no drawbacks and consists only of advantages.

In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from?

It is registered in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you install another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid rednecks, in addition to the actions described above, it will be useful to write in its header source code all pages of the site information about this very encoding, so that there is no confusion on the server or local host.

In general, in all languages hypertext markup In addition to Html, a special xml declaration is used, which indicates the text encoding.

Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

In the case of a document HTML language used to indicate encoding Meta element, which is written between the opening and closing Head tags:

... ...

This entry is quite different from the one adopted in, but is fully consistent with the new one being gradually introduced HTML standard 5, and it will be completely understood correctly by anyone used on this moment browsers.

In theory, a Meta element with an indication HTML encodings it would be better to put the document as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

Good luck to you! See you soon on the pages of the blog site

Each computer has its own set of characters that it implements. This set contains 26 uppercase and lowercase letters, numbers and Special symbols(dot, space, etc.). When converted to integers, symbols are called codes. Standards were developed so that computers would have the same sets of codes.

ASCII standard

ASCII (American Standard Code for Inmormation Interchange - American standard code for information exchange. Each ASCII character has 7 bits, so the maximum number of characters is 128 (Table 1). Codes 0 through 1F are control characters that are not printed. Many non-printable ASCII characters are needed to transmit data. For example, a message may consist of the start-of-header character SOH, the header itself and the start-of-text character STX, the text itself and the end-of-text character ETX, and the end-of-transmission character EOT. However, data over the network is transmitted in packets, which themselves are responsible for the beginning and end of the transmission. So non-printable characters are almost never used.

Table 1 - ASCII code table

Number Team Meaning Number Team Meaning
0 NUL Null pointer 10 DLE Exit from the transmission system
1 SOH start of title 11 DC1 Device management
2 STX Beginning of text 12 DC2 Device management
3 ETX End of text 13 DC3 Device management
4 EOT End of transmission 14 DC4 Device management
5 ACK Request 15 N.A.K. Non-confirmation of reception
6 BEL Acceptance confirmation 16 SYN Simple
7 B.S. Bell symbol 17 ETB End of transmission block
8 HT Step back 18 CAN Mark
9 LF Horizontal tabulation 19 E.M. End of media
A VT Line translation 1A SUB Subscript
B FF Vertical tab 1B ESC Exit
C CR Page translation 1C FS File separator
D SO Carriage return 1D G.S. Group separator
E S.I. Switch to additional register 1E R.S. Record separator
S.I. Switch to standard case 1F US Module separator
Number Symbol Number Symbol Number Symbol Number Symbol Number Symbol Number Symbol
20 space 30 0 40 @ 50 P 60 . 70 p
21 ! 31 1 41 A 51 Q 61 a 71 q
22 32 2 42 B 52 R 62 b 72 r
23 # 33 3 43 C 53 S 63 c 73 s
24 φ 34 4 44 D 54 T 64 d 74 t
25 % 35 5 45 E 55 AND 65 e 75 And
26 & 36 6 46 F 56 V 66 f 76 v
27 37 7 47 G 57 W 67 g 77 w
28 ( 38 8 48 H 58 X 68 h 78 x
29 ) 39 9 49 I 59 Y 69 i 70 y
2A 3A ; 4A J 5A Z 6A j 7A z
2B + 3B ; 4B K 5B [ 6B k 7B {
2C 3C < 4C L 5C \ 6C l 7C |
2D 3D = 4D M 5D ] 6D m 7D }
2E 3E > 4E N 5E 6E n 7E ~
2F / 3F g 4F O 5F _ 6F o 7F DEL

Unicode standard

The previous encoding works fine for in English, however, it is not convenient for other languages. For example in German There are umlauts, and in French there are superscripts. Some languages ​​have completely different alphabets. The first attempt at extending ASCII was IS646, which extended the previous encoding by an additional 128 characters. Latin letters with strokes were added and diacritics, and received the name - Latin 1. The next attempt was IS 8859 - which contained a code page. There were also attempts at extensions, but this was not universal. UNICODE encoding was created (is 10646). The idea behind the encoding is to assign each character a single constant 16-bit value, which is called - code pointer. In total there are 65536 pointers. To save space, we used Latin-1 for codes 0 -255, easily changing ASII to UNICODE. This standard solved many problems, but not all. Due to the arrival of new words, for example, for the Japanese language, it is necessary to increase the number of terms by about 20 thousand. It is also necessary to include braille.

Unicode (Unicode in English) is a character encoding standard. Simply put, this is a table of correspondence between text characters ( , letters, punctuation elements) binary codes. The computer only understands the sequence of zeros and ones. In order for it to know what exactly it should display on the screen, it is necessary to assign each symbol its own unique number. In the eighties, characters were encoded in one byte, that is, eight bits (each bit is a 0 or 1). Thus, it turned out that one table (aka encoding or set) can only accommodate 256 characters. This may not be enough even for one language. Therefore, many different encodings appeared, confusion with which often led to some strange gibberish appearing on the screen instead of readable text. A single standard was required, which is what Unicode became. The most used encoding is UTF-8 (Unicode Transformation Format), which uses 1 to 4 bytes to represent a character.


Characters in Unicode tables are numbered with hexadecimal numbers. For example, Cyrillic capital letter M is designated U+041C. This means that it stands at the intersection of row 041 and column C. You can simply copy it and then paste it somewhere. In order not to rummage through a multi-kilometer list, you should use the search. When you go to the symbol page, you will see its Unicode number and how it is written in different fonts. You can enter the sign itself into the search bar, even if a square is drawn instead, at least to find out what it was. Also, on this site there are special (and random) sets of the same type of icons, collected from different sections, for ease of use.

The Unicode standard is international. It includes characters from almost all scripts of the world. Including those that are no longer used. Egyptian hieroglyphs, Germanic runes, Mayan writing, cuneiform and alphabets of ancient states. Designations of weights and measures, musical notation, and mathematical concepts are also presented.

The Unicode Consortium itself does not invent new characters. Those icons that find their use in society are added to the tables. For example, the ruble sign was actively used for six years before it was added to Unicode. Emoji pictograms (emoticons) were also first widely used in Japan before they were included in the encoding. But trademarks and company logos are not added in principle. Even such common ones as the Apple apple or the Windows flag. To date, about 120 thousand characters are encoded in version 8.0.

Let's remember some facts we know:

The set of symbols with which text is written is called alphabet.

The number of characters in an alphabet is its cardinality.

Formula for determining the amount of information: N = 2 b,

where N is the power of the alphabet (number of characters),

b - number of bits (information weight of the symbol).

The alphabet with a capacity of 256 characters can accommodate almost all the necessary characters. Such an alphabet is called sufficient.

Because 256 = 2 8 , then the weight of 1 character is 8 bits.

The unit of measurement 8 bits was given the name 1 byte:

1 byte = 8 bits.

The binary code of each character in computer text takes up 1 byte of memory.

How is text information represented in computer memory?

Coding consists of assigning each character a unique decimal code from 0 to 255 or a corresponding binary code from 00000000 to 11111111. Thus, a person distinguishes characters by their outline, and a computer by their code.

The convenience of byte-by-byte character encoding is obvious because a byte is the smallest addressable part of memory and, therefore, the processor can access each character separately when processing text. On the other hand, 256 characters is quite a sufficient number to represent a wide variety of symbolic information.

Now the question arises, which eight-bit binary code to assign to each character.

It is clear that this is a conditional matter; you can come up with many encoding methods.

ASCII table has become the international standard for PCs (read aski) (American Standard Code for Information Interchange).

Only the first half of the table is the international standard, i.e. characters with numbers from 0 (00000000), to 127 (01111111).

Serial number


00000000 - 00011111

Their function is to control the process of displaying text on the screen or printing, feeding sound signal, text markup, etc.

32 - 127

00100000 - 01111111

128 - 255

10000000 - 11111111

The second half of the ASCII code table, called the code page (128 codes, starting with 10000000 and ending with 11111111), can have different variants, each variant having its own number.

Please note that in the encoding table, letters (uppercase and lowercase) are arranged in alphabetical order, and numbers are ordered in ascending order. This observance of lexicographic order in the arrangement of symbols is called the principle of sequential coding of the alphabet.

The most common encoding currently used is Microsoft Windows, abbreviated CP1251.

Since the late 90s, the problem of standardizing character encoding has been solved by the introduction of a new international standard called Unicode . This is a 16-bit encoding, i.e. it allocates 2 bytes of memory for each character. Of course, this increases the amount of memory occupied by 2 times. But such a code table allows the inclusion of up to 65536 characters. The complete specification of the Unicode standard includes all the existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Let's try using an ASCII table to imagine what words will look like in the computer's memory.











When entering text information into a computer, characters (letters, numbers, signs) are encoded using various code systems, which consist of a set of code tables located on the corresponding pages of standards for encoding text information. In such tables, each character is assigned a specific numeric code in a hexadecimal or decimal number system, i.e., code tables reflect the correspondence between symbol images and numeric codes and are intended for encoding and decoding text information. When entering text information using a computer keyboard, each entered character is encoded, i.e., converted into a numeric code; when text information is output to a computer output device (display, printer or plotter), its image is constructed using the numeric code of the character. The assignment of a specific numeric code to a symbol is the result of an agreement between relevant organizations in different countries. Currently, there is no single universal code table that matches the letters of the national alphabets of different countries.

Modern code tables include international and national parts, i.e. they contain letters of the Latin and national alphabets, numbers, arithmetic operations and punctuation marks, mathematical and control symbols, and pseudographic symbols. International part of the code table based on the standard ASCII (American Standard Code for Information Interchange), encodes the first half of the characters in the code table with numeric codes from 0 to 7 F 16, or in the decimal number system from 0 to 127. In this case, codes from 0 to 20 16 (0 ? 32 10) are allocated function keys(F1, F2, F3, etc.) keyboard of a personal computer. In Fig. 3.1 shows the international part of the code tables based on the standard ASCII. Table cells are numbered in decimal and hexadecimal number systems, respectively.

Figure 3.1. International part of the code table (standard ASCII) with cell numbers presented in decimal (a) and hexadecimal (b) number systems

The national part of code tables contains codes of national alphabets, which is also called a table of character sets (charset).

Currently, to support letters of the Russian alphabet (Cyrillic), there are several code tables (encodings) that are used by various operating systems, which is a significant drawback and in some cases leads to problems associated with decoding operations of numeric character values. In table 3.1 shows the names of the code pages (standards) on which the Cyrillic code tables (encodings) are located.

Table 3.1

One of the first standards for encoding the Cyrillic alphabet on computers was the KOI8-R standard. The national part of the code table of this standard is shown in Fig. 3.2.

Rice. 3.2. National part of the code table of the KOI8-R standard

Currently, the code table located on page CP866 of the text information encoding standard, which is used in the operating system, is also used MS DOS or session MS DOS for encoding the Cyrillic alphabet (Fig. 3.3, A).

Rice. 3.3. The national part of the code table, located on page CP866 (a) and on page CP1251 (b) of the text information coding standard

Currently, the most widely used code table for encoding the Cyrillic alphabet is located on page CP1251 of the corresponding standard, which is used in operating systems of the family Windows companies Microsoft(Fig. 3.2, b). In all presented code tables, except the standard table Unicode To encode one character, 8 binary digits (8 bits) are allocated.

At the end of the last century, a new international standard appeared Unicode in which one character is represented as a two-byte binary code. The application of this standard is a continuation of the development of a universal international standard to solve the problem of compatibility of national character encodings. Using this standard you can encode 2 16 = 65536 various characters. In Fig. 3.4 shows the code table 0400 (Russian alphabet) of the standard Unicode.

Rice. 3.4. Unicode code table 0400

Let us explain what has been said regarding the encoding of text information using an example.

Example 3.1

Encode the word "Computer" as a sequence of decimals and hexadecimal numbers, using CP1251 encoding. What characters will be displayed in the CP866 and KOI8-R code tables when using the received code.

Sequences of hexadecimal and binary code of the word “Computer” based on the CP1251 encoding table (see Fig. 3.3, b) will look like this:

This code sequence in SR866 and KOI8-R encodings will result in the display of the following characters:

To convert Russian-language text documents from one text information encoding standard to another, special programs are used - converters. Converters are usually built into other programs. An example would be a browser program - Internet Explorer(IE), which has a built-in converter. The browser program is special program to view content Web pages in the global computer network Internet. Let's use this program to confirm the symbol mapping result obtained in example 3.1. To do this, we will perform the following steps.

1. Launch the Notepad program (NotePad). Notepad program in the operating system Windows XP launched using the command: [Button Start– Programs – Standard – Notepad]. In the Notepad program window that opens, type the word “Computer” using the syntax of the hypertext document markup language - HTML (Hyper Text Markup Language). This language is used to create documents on the Internet. The text should look like this:


, Where


tags (special constructs) of the language HTML for header markup. In Fig. Figure 3.5 shows the result of these actions.

Rice. 3.5. Displaying text in the Notepad window

Let's save this text by executing the command: [File - Save as...] in the appropriate folder on the computer; when saving the text, we will give the file a name - Note, with a file extension. html.

2. Let's launch the program Internet Explorer, by executing the command: [Button Start- Programs - Internet Explorer]. When you start the program, the window shown in Fig. 3.6

Rice. 3.6. Offline access window

Select and activate the button Offline this will prevent the computer from connecting to global network Internet. The main program window will appear Microsoft Internet Explorer, shown in Fig. 3.7.

Rice. 3.7. Microsoft Internet Explorer main window

Let's execute the following command: [File – Open], a window will appear (Fig. 3.8), in which you need to specify the file name and click the button OK or press the button Review… and find the file Prim.html.

Rice. 3.8. Open window

Main window Internet programs Explorer will take the form shown in Fig. 3.9. The word “Computer” will appear in the window. Next, using the top menu of the program Internet Explorer, run the following command: [View – Encoding – Cyrillic (DOS)]. After executing this command in the program window Internet Explorer The symbols shown in Fig. will be displayed. 3.10. When executing the command: [View – Encoding – Cyrillic (KOI8-R) ] in the program window Internet Explorer The symbols shown in Fig. will be displayed. 3.11.

Rice. 3.9. Characters displayed with CP1251 encoding

Rice. 3.10. Characters displayed when CP866 encoding is enabled for a code sequence represented in CP1251 encoding

Rice. 3.11. Characters displayed when KOI8-R encoding is enabled for a code sequence represented in CP1251 encoding

Thus, obtained using the program Internet Explorer the character sequences coincide with the character sequences obtained using the CP866 and KOI8-R code tables in example 3.1.

3.2. Encoding graphic information

Graphic information presented in the form of pictures, photographs, slides, moving images (animation, video), diagrams, drawings can be created and edited using a computer, and it is encoded accordingly. Currently, there are quite a large number of application programs for processing graphic information, but they all implement three types computer graphics: raster, vector and fractal.

If you look more closely at the graphic image on the computer monitor screen, you can see a large number of multi-colored dots (pixels - from English. pixel educated from picture element – element of the image), which, when collected together, form a given graphic image. From this we can conclude: a graphic image on a computer is encoded in a certain way and must be presented in the form of a graphic file. A file is the basic structural unit of organizing and storing data on a computer and, in this case, must contain information on how to present this set of points on the monitor screen.

Files created from vector graphics, contain information in the form of mathematical dependencies (mathematical functions that describe linear dependencies) and corresponding data on how to construct an image of an object using line segments (vectors) when displaying it on a computer monitor.

Files created from raster graphics, involve storing data about each individual point in the image. To display raster graphics, complex mathematical calculations are not required; it is enough to simply obtain data about each point of the image (its coordinates and color) and display them on the computer monitor screen.

During the encoding process, an image is spatially discretized, i.e., the image is divided into individual points and each point is given a color code (yellow, red, blue, etc.). To encode each point of a color graphic image, the principle of decomposition of an arbitrary color into its main components is used, for which three primary colors are used: red (English word Red, denoted by a letter TO), green (Green, denoted by a letter G), blue (Blue, denoted by beech IN). Any color of a dot perceived by the human eye can be obtained by additive (proportional) addition (mixing) of three primary colors - red, green and blue. This coding system is called a color system RGB. Files graphic images, in which the color system is used RGB represent each point of the image as a color triplet - three numerical values R, G And IN, corresponding intensities of red, green and blue colors. The process of encoding a graphic image is carried out using various technical means (scanner, digital camera, digital video camera, etc.); the result is a raster image. When reproducing color graphic images on a color computer monitor, the color of each point (pixel) of such an image is obtained by mixing three primary colors R,G And B.

Quality bitmap is determined by two main parameters - resolution (the number of pixels horizontally and vertically) and the color palette used (the number of specified colors for each pixel in the image). Resolution is specified by indicating the number of pixels horizontally and vertically, for example 800 by 600 pixels.

There is a relationship between the number of colors assigned to a point in a raster image and the amount of information that must be allocated to store the color of the point, determined by the relationship (R. Hartley’s formula):

Where I– amount of information; N – the number of colors assigned to the point.

The amount of information required to store the color of a point is also called color depth, or color rendering quality.

So, if the number of colors specified for an image point is N= 256, then the amount of information required for its storage (color depth) in accordance with formula (3.1) will be equal to I= 8 bits.

In computers for display graphic information various graphics modes monitor operation. It should be noted here that in addition to the graphic mode of the monitor, there is also a text mode, in which the monitor screen is conventionally divided into 25 lines of 80 characters per line. These graphics modes are characterized by the monitor's screen resolution and color quality (color depth). To set the graphic mode of the monitor screen in the operating system MS Windows XP you need to execute the command: [Button Start– Settings – Control Panel – Screen]. In the “Properties: Screen” dialog box that appears (Fig. 3.12), you must select the “Parameters” tab and use the “Screen Resolution” slider to select the appropriate screen resolution (800 by 600 pixels, 1024 by 768 pixels, etc.). Using the “Color quality” drop-down list, you can select the color depth - “Highest (32 bits)”, “Medium (16 bits)”, etc., and the number of colors assigned to each point in the image will be respectively 2 32 (4294967296), 2 16 (65536), etc.

Rice. 3.12. Display Properties Dialog Box

To implement each of the graphic modes of the monitor screen, a certain amount of computer video memory is required. Required information volume of video memory (V) is determined from the relation

Where TO - number of image points on the monitor screen (K = A · B); A - number of horizontal dots on the monitor screen; IN - number of vertical dots on the monitor screen; I– amount of information (color depth).

So, if the monitor screen has a resolution of 1024 by 768 pixels and a palette consisting of 65,536 colors, then the color depth in accordance with formula (3.1) will be I = log 2 65,538 = 16 bits, the number of image pixels will be equal to: K = 1024 x 768 = 786432, and the required information volume of video memory in accordance with (3.2) will be equal to

V= 786432 · 16 bits = 12582912 bits = 1572864 bytes = 1536 KB = 1.5 MB.

In conclusion, it should be noted that in addition to the listed characteristics, the most important characteristics of a monitor are the geometric dimensions of its screen and image points. The geometric dimensions of the screen are determined by the diagonal size of the monitor. The diagonal size of monitors is specified in inches (1 inch = 1" = 25.4 mm) and can take values ​​equal to: 14", 15", 17", 21", etc. Modern monitor production technologies can provide an image point size equal to 0.22 mm.

Thus, for each monitor there is a physically maximum possible screen resolution, determined by the size of its diagonal and the size of the image point.

Exercises to do on your own

1. Using the program MS Excel convert ASCII, SR866, SR1251, KOI8-R code tables to tables of the form: in the cells of the first column of the tables write in alphabetical order the uppercase and then lowercase letters of the Latin and Cyrillic alphabet, in the cells of the second column - the codes corresponding to the letters in the decimal number system, in the cells the third column is the codes corresponding to the letters in the hexadecimal number system. Code values ​​must be selected from the corresponding code tables.

2. Encode and write down the following words as a sequence of numbers in the decimal and hexadecimal number systems:

a) Internet Explorer, b) Microsoft Office; V) CorelDRAW.

Encoding is carried out using the modernized ASCII encoding table obtained in the previous exercise.

3. Using the modernized KOI8-R encoding table, decode sequences of numbers written in the hexadecimal number system:

a) FC CB DA C9 D3 D4 C5 CE C3 C9 D1;


c) FC CB D3 D0 D2 C5 D3 C9 CF CE C9 DA CD.

4. How will the word “Cybernetics” written in SR1251 encoding look like when using SR866 and KOI8-R encodings? Check the results using the program Internet Explorer.

5. Using the code table shown in Fig. 3.1 A, decode the following code sequences written in binary number system:

a) 01010111 01101111 01110010 01100100;

b) 01000101 01111000 01100011 01100101 01101100;

c) 01000001 01100011 01100011 01100101 01110011 01110011.

6. Determine the information volume of the word “Economy”, encoded using code tables SR866, SR1251, Unicode and KOI8-R.

7. Determine the information volume of the file obtained as a result of scanning a color image measuring 12x12 cm. The resolution of the scanner used to scan this image is 600 dpi. The scanner sets the color depth of the image point to 16 bits.

Scanner resolution 600 dpi (dotper inch - dots per inch) determines the ability of a scanner with this resolution to distinguish 600 dots on a 1-inch segment.

8. Determine the information volume of the file obtained as a result of scanning a color image of A4 size. The resolution of the scanner used to scan this image is 1200 dpi. The scanner sets the color depth of the image point to 24 bits.

9. Determine the number of colors in the palette at color depths of 8, 16, 24 and 32 bits.

10. Determine the required amount of video memory for graphic modes of the monitor screen 640 by 480, 800 by 600, 1024 by 768 and 1280 by 1024 pixels with an image pixel color depth of 8, 16, 24 and 32 bits. Summarize the results in a table. Develop in MS Excel program for automating calculations.

11. Determine the maximum number of colors that can be used to store an image measuring 32 by 32 pixels, if the computer has 2 KB of memory allocated for the image.

12. Determine the maximum possible resolution of a monitor screen with a diagonal length of 15" and an image point size of 0.28 mm.

13. What graphic modes of the monitor can be provided by 64 MB of video memory?


I . History of information coding

Humanity has been using text encryption (encoding) since the very moment when the first one appeared. secret information. Here are several text encoding techniques that were invented at various stages of the development of human thought:

Cryptography is secret writing, a system of changing writing in order to make the text incomprehensible to the uninitiated;

Morse code or uneven telegraph code, in which each letter or sign is represented by its own combination of short elementary bursts of electric current (dots) and elementary bursts of triple duration (dash);

sign language is a sign language used by people with hearing impairments.

One of the earliest known encryption methods is named after the Roman emperor Julius Caesar (1st century BC). This method is based on replacing each letter of the encrypted text with another, by shifting the alphabet from the original letter by a fixed number of characters, and the alphabet is read in a circle, that is, after the letter i, a is considered. So the word “byte”, when shifted two characters to the right, is encoded as the word “gwlf”. Reverse decryption process of this word– it is necessary to replace each encrypted letter with the second one to the left of it.

II. Encoding information

Code is a set symbols(or signals) to record (or convey) some predefined concepts.

Information coding is the process of forming a specific representation of information. In a narrower sense, the term “coding” is often understood as a transition from one form of information representation to another, more convenient for storage, transmission or processing.

Usually, each image when encoding (sometimes called encryption) is represented by a separate sign.

A sign is an element of a finite set of elements distinct from each other.

In a narrower sense, the term “coding” is often understood as a transition from one form of information representation to another, more convenient for storage, transmission or processing.

You can process text information on a computer. When entered into a computer, each letter is encoded with a specific number, and when output to external devices(screen or print) for human perception, images of letters are built from these numbers. The correspondence between a set of letters and numbers is called a character encoding.

As a rule, all numbers in a computer are represented using zeros and ones (not ten digits, as is usual for people). In other words, computers usually operate in the binary number system, since this makes the devices for processing them much simpler. Entering numbers into a computer and outputting them for human reading can be done in the usual decimal form, and all necessary conversions are performed by programs running on the computer.

III. Encoding text information

The same information can be presented (encoded) in several forms. With the advent of computers, the need arose to encode all types of information that both an individual and humanity as a whole deal with. But humanity began to solve the problem of encoding information long before the advent of computers. The grandiose achievements of mankind - writing and arithmetic - are nothing more than a system for encoding speech and numerical information. Information never appears in its pure form, it is always presented somehow, encoded somehow.

Binary coding is one of the common ways of representing information. IN computers,In CNC robots and machine tools, typically all the ,information the device deals with is encoded as words of the ,binary alphabet.

Since the late 60s, computers have increasingly been used for processing text information, and currently the bulk of personal computers in the world (and most of the time) is occupied with processing textual information. All these types of information in a computer are presented in binary code, that is, an alphabet of power two is used (only two characters 0 and 1). This is due to the fact that it is convenient to represent information in the form of a sequence of electrical impulses: there is no impulse (0), there is an impulse (1).

Such coding is usually called binary, and the logical sequences of zeros and ones themselves are called machine language.

From a computer point of view, text consists of individual characters. The symbols include not only letters (uppercase or lowercase, Latin or Russian), but also numbers, punctuation marks, special characters such as "=", "(", "&", etc., and even (pay special attention!) spaces between words.

Texts are entered into the computer's memory using the keyboard. The letters, numbers, punctuation marks and other symbols we are familiar with are written on the keys. IN RAM they come in binary code. This means that each character is represented by 8-bit binary code.

Traditionally, to encode one character, an amount of information equal to 1 byte is used, i.e. I = 1 byte = 8 bits. Using a formula that connects the number of possible events K and the amount of information I, you can calculate how many different symbols can be encoded (assuming that symbols are possible events): K = 2 I = 2 8 = 256, i.e. for To represent text information, you can use an alphabet with a capacity of 256 characters.

This number of characters is quite sufficient to represent text information, including upper and lowercase letters of the Russian and Latin alphabet, numbers, signs, graphic symbols etc.

Coding consists of assigning each character a unique decimal code from 0 to 255 or the corresponding binary code from 00000000 to 11111111. Thus, a person distinguishes characters by their outline, and a computer by their code.

The convenience of byte-by-byte character encoding is obvious because a byte is the smallest addressable part of memory and, therefore, the processor can access each character separately when processing text. On the other hand, 256 characters is quite a sufficient number to represent a wide variety of symbolic information.

In the process of displaying a symbol on a computer screen, the reverse process is performed - decoding, that is, converting the symbol code into its image. It is important that assigning a specific code to a symbol is a matter of agreement, which is recorded in the code table.

Now the question arises, which eight-bit binary code to assign to each character. It is clear that this is a conditional matter; you can come up with many encoding methods.

All characters of the computer alphabet are numbered from 0 to 255. Each number corresponds to an eight-bit binary code from 00000000 to 11111111. This code is simply the serial number of the character in the binary number system.

IV . Types of encoding tables

A table in which all the characters of the computer alphabet are assigned to each other serial numbers, is called an encoding table.

For different types Computers use different encoding tables.

The ASCII code table (American Standard Code for Information Interchange) has been adopted as an international standard, encoding the first half of characters with numeric codes from 0 to 127 (codes from 0 to 32 are assigned not to characters, but to function keys).

The ASCII code table is divided into two parts.

Only the first half of the table is the international standard, i.e. characters with numbers from 0 (00000000), to 127 (01111111).

ASCII encoding table structure

Serial number Code Symbol
0 - 31 00000000 - 00011111

Symbols with numbers from 0 to 31 are usually called control symbols.

Their function is to control the process of displaying text on the screen or printing, sounding a sound signal, marking up text, etc.

32 - 127 0100000 - 01111111

Standard part of the table (English). This includes lowercase and uppercase letters of the Latin alphabet, decimal digits, punctuation marks, all kinds of brackets, commercial and other symbols.

Character 32 is a space, i.e. empty position in the text.

All others are reflected by certain signs.

128 - 255 10000000 - 11111111

Alternative part of the table (Russian).

The second half of the ASCII code table, called the code page (128 codes, starting with 10000000 and ending with 11111111), may have various options, each option has its own number.

The code page is primarily used to accommodate national alphabets other than Latin. In Russian national encodings, characters from the Russian alphabet are placed in this part of the table.

First half of the ASCII code table

Please note that in the encoding table, letters (uppercase and lowercase) are arranged in alphabetical order, and numbers are ordered in ascending order. This observance of lexicographic order in the arrangement of symbols is called the principle of sequential coding of the alphabet.

For letters of the Russian alphabet, the principle of sequential coding is also observed.

Second half of the ASCII code table

Unfortunately, there are currently five different Cyrillic encodings (KOI8-R, Windows. MS-DOS, Macintosh and ISO). Because of this, problems often arise with transferring Russian text from one computer to another, from one software system to another.

Chronologically, one of the first standards for encoding Russian letters on computers was KOI8 ("Information Exchange Code, 8-bit"). This encoding was used back in the 70s on computers of the ES computer series, and from the mid-80s it began to be used in the first Russified versions of the UNIX operating system.

From the early 90s, the time of dominance of the MS DOS operating system, the CP866 encoding remains ("CP" means "Code Page", "code page").

Apple computers running operating systems Mac systems OS, use their own Mac encoding.

In addition, the International Standards Organization (ISO) has approved another encoding called ISO 8859-5 as a standard for the Russian language.

The most common encoding currently used is Microsoft Windows, abbreviated CP1251. Introduced by Microsoft; taking into account the widespread use of operating systems (OS) and other software products This company has found wide distribution in the Russian Federation.

Since the late 90s, the problem of standardizing character encoding has been solved by the introduction of a new international standard called Unicode.

This is a 16-bit encoding, i.e. it allocates 2 bytes of memory for each character. Of course, this increases the amount of memory occupied by 2 times. But such a code table allows the inclusion of up to 65536 characters. The complete specification of the Unicode standard includes all the existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Internal representation of words in computer memory

using an ASCII table

Sometimes it happens that a text consisting of letters of the Russian alphabet received from another computer cannot be read - some kind of “abracadabra” is visible on the monitor screen. This happens because computers use different encoding characters of the Russian language.

Thus, each encoding is specified by its own code table. As can be seen from the table, different characters are assigned to the same binary code in different encodings.

For example, the sequence of numeric codes 221, 194, 204 in the CP1251 encoding forms the word “computer”, whereas in other encodings it will be a meaningless set of characters.

Fortunately, in most cases the user does not have to worry about transcoding text documents, since this is done by special converter programs built into applications.

V . Calculation of the amount of text information

Task 1: Encode the word “Rome” using the KOI8-R and CP1251 encoding tables.


Task 2: Assuming that each character is encoded in one byte, estimate the information volume of the following sentence:

“My uncle has the most honest rules,

When I seriously fell ill,

He forced himself to respect

And I couldn’t think of anything better.”

Solution: This phrase has 108 characters, including punctuation, quotation marks and spaces. We multiply this number by 8 bits. We get 108*8=864 bits.

Task 3: The two texts contain the same number of characters. The first text is written in Russian, and the second in the language of the Naguri tribe, whose alphabet consists of 16 characters. Whose text contains more information?


1) I = K * a (the information volume of the text is equal to the product of the number of characters and the information weight of one character).

2) Because Both texts have the same number of characters (K), then the difference depends on the information content of one character of the alphabet (a).

3) 2 a1 = 32, i.e. a 1 = 5 bits, 2 a2 = 16, i.e. and 2 = 4 bits.

4) I 1 = K * 5 bits, I 2 = K * 4 bits.

5) This means that the text written in Russian carries 5/4 times more information.

Task 4: The size of the message, containing 2048 characters, was 1/512 of a MB. Determine the power of the alphabet.


1) I = 1/512 * 1024 * 1024 * 8 = 16384 bits - converted the information volume of the message into bits.

2) a = I / K = 16384 /1024 = 16 bits - accounts for one character of the alphabet.

3) 2*16*2048 = 65536 characters – the power of the alphabet used.

Task 5: Laser Canon printer LBP prints at an average speed of 6.3 Kbps. How long will it take to print an 8-page document, if you know that one page has an average of 45 lines and 70 characters per line (1 character - 1 byte)?


1) Find the amount of information contained on 1 page: 45 * 70 * 8 bits = 25200 bits

2) Find the amount of information on 8 pages: 25200 * 8 = 201600 bits

3) We reduce to common units of measurement. To do this, we convert Mbits into bits: 6.3*1024=6451.2 bits/sec.

4) Find the printing time: 201600: 6451.2 =31 seconds.


Material for self-study on the topic of Lecture 2

Encoding ASCII

ASCII encoding table (ASCII - American Standard Code for Information Interchange - American Standard Code for Information Interchange).

In total, 256 different characters can be encoded using the ASCII encoding table (Figure 1). This table is divided into two parts: the main one (with codes from OOh to 7Fh) and the additional one (from 80h to FFh, where the letter h indicates that the code belongs to the hexadecimal number system).

Picture 1

To encode one character from the table, 8 bits (1 byte) are allocated. When processing text information, one byte may contain the code of a certain character - a letter, number, punctuation mark, action sign, etc. Each character has its own code in the form of an integer. In this case, all codes are collected in special tables called coding tables. With their help, the symbol code is converted into its visible representation on the monitor screen. As a result, any text in computer memory is represented as a sequence of bytes with character codes.

For example, the word hello! will be coded as follows (Table 1).

Table 1

Binary code

Code decimal

Figure 1 shows the characters included in the standard (English) and extended (Russian) ASCII encoding.

The first half of the ASCII table is standardized. It contains control codes (from 00h to 20h and 77h). These codes have been removed from the table because they do not apply to text elements. Punctuation marks and mathematical symbols are also placed here: 2lh - !, 26h - &, 28h - (, 2Bh -+,..., large and small Latin letters: 41h - A, 61h – a.

The second half of the table contains national fonts, pseudographic symbols from which tables can be constructed, and special mathematical symbols. The lower part of the encoding table can be replaced using appropriate drivers - control auxiliary programs. This technique allows you to use several fonts and their typefaces.

The display for each symbol code should display an image of the symbol - not just a digital code, but a corresponding picture, since each symbol has its own shape. A description of the shape of each character is stored in a special display memory - a character generator. The highlighting of a character on the screen of an IBM PC display, for example, is carried out using dots forming a character matrix. Each pixel in such a matrix is ​​an image element and can be bright or dark. A dark dot is coded as 0, a light (bright) dot as 1. If you represent dark pixels in the matrix field of a sign as a dot, and light pixels as an asterisk, you can graphically depict the shape of the symbol.

People in different countries use symbols to write words in their native languages. These days, most applications, including systems Email and web browsers are purely 8-bit, meaning they can only display and correctly accept 8-bit characters, according to the ISO-8859-1 standard.

There are more than 256 characters in the world (if you take into account Cyrillic, Arabic, Chinese, Japanese, Korean and Thai), and more and more new characters are appearing. And this creates the following gaps for many users:

It is not possible to use characters from different encoding sets in the same document. Since each text document uses its own set of encodings, there are great difficulties with automatic recognition text.

New symbols appear (for example: Euro), as a result of which ISO is developing a new standard, ISO-8859-15, which is very similar to the ISO-8859-1 standard. The difference is that the old ISO-8859-1 encoding table has removed symbols for old currencies that are not currently in use to make room for newly introduced symbols (such as the Euro). As a result, users may have the same documents on their disks, but in different encodings. The solution to these problems is the adoption of a single international set of encodings called the Universal Coding or Unicode.

Encoding Unicode

The standard was proposed in 1991 by the non-profit organization Unicode Consortium (Unicode Inc.). The use of this standard allows you to encode a very large number of characters from different scripts: Unicode documents can contain Chinese characters, mathematical symbols, letters of the Greek alphabet, Latin and Cyrillic alphabet, and switching code pages becomes unnecessary.

The standard consists of two main sections: the universal character set (UCS) and the encoding family (UTF, Unicode transformation format). The universal character set specifies a one-to-one correspondence between characters and codes - elements of the code space representing non-negative integers. An encoding family defines the machine representation of a sequence of UCS codes.

The Unicode standard was developed to create a single character encoding for all modern and many ancient written languages. Each character in this standard is encoded with 16 bits, which allows it to cover an incomparably larger number of characters than previously accepted 8-bit encodings. Another important difference between Unicode and other encoding systems is that it not only assigns a unique code to each character, but also defines various characteristics of that character, for example:

    character type (uppercase letter, lowercase letter, number, punctuation mark, etc.);

    character attributes (display from left to right or right to left, space, line break, etc.);

    corresponding capital or lowercase letter(for lowercase and capital letters respectively);

    appropriate numeric value(for digital characters).

The entire range of codes from 0 to FFFF is divided into several standard subsets, each of which corresponds either to the alphabet of a language or to a group of special characters that are similar in their functions. The diagram below contains a general list of Unicode 3.0 subsets (Figure 2).

Figure 2

The Unicode standard is the basis for storing text in many modern computer systems. However, it is not compatible with most Internet protocols because its codes can contain any byte values, and protocols typically use bytes 00 - 1F and FE - FF as service bytes. To achieve compatibility, several Unicode Transformation Formats (UTFs) have been developed, of which UTF-8 is by far the most common. This format defines the following rules for converting each Unicode code into a set of bytes (one to three) suitable for transport by Internet protocols.

Here x,y,z denote the bits of the source code that should be extracted, starting with the least significant one, and entered into the result bytes from right to left until all specified positions are filled.

Further development of the Unicode standard is associated with the addition of new language planes, i.e. characters in the ranges 10000 - 1FFFF, 20000 - 2FFFF, etc., where it is supposed to include encoding for scripts of dead languages ​​that are not included in the table above. A new format, UTF-16, was developed to encode these additional characters.

So there are 4 main ways to encode Unicode bytes:

UTF-8: 128 characters encoded in one byte (ASCII format), 1920 characters encoded in 2 bytes ((Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters), 63488 characters encoded in 3 bytes (Chinese, Japanese etc.) The remaining 2,147,418,112 characters (not yet used) can be encoded with 4, 5 or 6 bytes.

UCS-2: Each character is represented by 2 bytes. This encoding includes only the first 65,535 characters from the Unicode format.

UTF-16: An extension of UCS-2, it contains 1,114,112 Unicode format characters. The first 65,535 characters are represented by 2 bytes, the rest by 4 bytes.

USC-4: Each character is encoded in 4 bytes.

Dec Hex Symbol Dec Hex Symbol
000 00 specialist. NOP 128 80 Ђ
001 01 specialist. SOH 129 81 Ѓ
002 02 specialist. STX 130 82
003 03 specialist. ETX 131 83 ѓ
004 04 specialist. EOT 132 84
005 05 specialist. ENQ 133 85
006 06 specialist. ACK 134 86
007 07 specialist. BEL 135 87
008 08 specialist. B.S. 136 88
009 09 specialist. TAB 137 89
010 0Aspecialist. LF 138 8AЉ
011 0Bspecialist. VT 139 8B‹ ‹
012 0Cspecialist. FF 140 8CЊ
013 0Dspecialist. CR 141 8DЌ
014 0Especialist. SO 142 8EЋ
015 0Fspecialist. S.I. 143 8FЏ
016 10 specialist. DLE 144 90 ђ
017 11 specialist. DC1 145 91
018 12 specialist. DC2 146 92
019 13 specialist. DC3 147 93
020 14 specialist. DC4 148 94
021 15 specialist. N.A.K. 149 95
022 16 specialist. SYN 150 96
023 17 specialist. ETB 151 97
024 18 specialist. CAN 152 98
025 19 specialist. E.M. 153 99
026 1Aspecialist. SUB 154 9Aљ
027 1Bspecialist. ESC 155 9B
028 1Cspecialist. FS 156 9Cњ
029 1Dspecialist. G.S. 157 9Dќ
030 1Especialist. R.S. 158 9Eћ
031 1Fspecialist. US 159 9Fџ
032 20 clutch SP (Space) 160 A0
033 21 ! 161 A1 Ў
034 22 " 162 A2ў
035 23 # 163 A3Ћ
036 24 $ 164 A4¤
037 25 % 165 A5Ґ
038 26 & 166 A6¦
039 27 " 167 A7§
040 28 ( 168 A8Yo
041 29 ) 169 A9©
042 2A* 170 A.A.Є
043 2B+ 171 AB«
044 2C, 172 A.C.¬
045 2D- 173 AD­
046 2E. 174 A.E.®
047 2F/ 175 A.F.Ї
048 30 0 176 B0°
049 31 1 177 B1±
050 32 2 178 B2І
051 33 3 179 B3і
052 34 4 180 B4ґ
053 35 5 181 B5µ
054 36 6 182 B6
055 37 7 183 B7·
056 38 8 184 B8e
057 39 9 185 B9
058 3A: 186 B.A.є
059 3B; 187 BB»
060 3C< 188 B.C.ј
061 3D= 189 BDЅ
062 3E> 190 BEѕ
063 3F? 191 B.F.ї
064 40 @ 192 C0 A
065 41 A 193 C1 B
066 42 B 194 C2 IN
067 43 C 195 C3 G
068 44 D 196 C4 D
069 45 E 197 C5 E
070 46 F 198 C6 AND
071 47 G 199 C7 Z
072 48 H 200 C8 AND
073 49 I 201 C9 Y
074 4AJ 202 C.A. TO
075 4BK 203 C.B. L
076 4CL 204 CC M
077 4DM 205 CD N
078 4EN 206 C.E. ABOUT
079 4FO 207 CF P
080 50 P 208 D0 R
081 51 Q 209 D1 WITH
082 52 R 210 D2 T
083 53 S 211 D3 U
084 54 T 212 D4 F
085 55 U 213 D5 X
086 56 V 214 D6 C
087 57 W 215 D7 H
088 58 X 216 D8 Sh
089 59 Y 217 D9 SCH
090 5AZ 218 D.A. Kommersant
091 5B[ 219 D.B. Y
092 5C\ 220 DC b
093 5D] 221 DD E
094 5E^ 222 DE YU
095 5F_ 223 DF I
096 60 ` 224 E0 A
097 61 a 225 E1 b
098 62 b 226 E2 V
099 63 c 227 E3 G
100 64 d 228 E4 d
101 65 e 229 E5 e
102 66 f 230 E6 and
103 67 g 231 E7 h
104 68 h 232 E8 And
105 69 i 233 E9 th
106 6Aj 234 E.A. To
107 6Bk 235 E.B. l
108 6Cl 236 E.C. m
109 6Dm 237 ED n
110 6En 238 E.E. O
111 6Fo 239 E.F. P
112 70 p 240 F0 R
113 71 q 241 F1 With
114 72 r 242 F2 T
115 73 s 243 F3 at
116 74 t 244 F4 f
117 75 u 245 F5 X
118 76 v 246 F6 ts
119 77 w 247 F7 h
120 78 x 248 F8 w
121 79 y 249 F9 sch
122 7Az 250 F.A. ъ
123 7B{ 251 FB s
124 7C| 252 F.C. b
125 7D} 253 FD uh
126 7E~ 254 F.E. Yu
127 7FSpecialist. DEL 255 FF I

ASCII Windows character code table.
Description of special (control) characters

It should be noted that ASCII table control characters were originally used to ensure data exchange via teletypewriter, data entry from punched tape, and for simple control of external devices.
Currently, most of the ASCII table control characters no longer carry this load and can be used for other purposes.
Code Description
NUL, 00Null, empty
SOH, 01Start Of Heading
STX, 02Start of TeXt, the beginning of the text.
ETX, 03End of TeXt, end of text
EOT, 04End of Transmission, end of transmission
ENQ, 05Enquire. Please confirm
ACK, 06Acknowledgment. I confirm
BEL, 07Bell, call
BS, 08Backspace, go back one character
TAB, 09Tab, horizontal tab
LF, 0ALine Feed, line feed.
Nowadays in most programming languages ​​it is denoted as \n
VT, 0BVertical Tab, vertical tabulation.
FF, 0CForm Feed, page feed, new page
CR, 0DCarriage Return, carriage return.
Nowadays in most programming languages ​​it is denoted as \r
SO,0EShift Out, change the color of the ink ribbon in the printing device
SI, 0FShift In, return the color of the ink ribbon in the printing device back
DLE, 10Data Link Escape, switching the channel to data transmission
DC1, 11
DC2, 12
DC3, 13
DC4, 14
Device Control, device control symbols
NAK, 15Negative Acknowledgment, I do not confirm.
SYN, 16Synchronization. Synchronization symbol
ETB, 17End of Text Block, end of the text block
CAN, 18Cancel, canceling previously transferred
EM, 19End of Medium
SUB, 1ASubstitute, substitute. Placed in place of a symbol whose meaning was lost or corrupted during transmission
ESC, 1BEscape Control Sequence
FS, 1CFile Separator, file separator
GS, 1DGroup Separator
RS, 1ERecord Separator, record separator
US, 1FUnit Separator
DEL, 7FDelete, erase the last character.