Ascii binary. Encoding text information


A computer understands the process of converting it into a form that allows for more convenient transmission, storage or automatic processing of this data. Various tables are used for this purpose. ASCII was the first system developed in the United States for working with English text, which subsequently became widespread throughout the world. Its description, features, properties and further use The article below is devoted to this.

Display and storage of information in a computer

Symbols on a computer monitor or one or another mobile digital gadget are formed based on sets of vector forms of various characters and a code that allows you to find among them the symbol that needs to be inserted in the right place. It represents a sequence of bits. Thus, each character must uniquely correspond to a set of zeros and ones, which appear in a certain, unique order.

How it all began

Historically, the first computers were English-language. For coding symbolic information in them it was enough to use only 7 bits of memory, while 1 byte consisting of 8 bits was allocated for this purpose. The number of characters understood by the computer in this case was 128. These characters included the English alphabet with its punctuation marks, numbers and some special characters. The English-language seven-bit encoding with the corresponding table (code page), developed in 1963, was called the American Standard Code for Information Interchange. Usually, the abbreviation “ASCII encoding” was and is still used to denote it.

Transition to multilingualism

Over time, computers became widely used in non-English speaking countries. In this regard, there was a need for encodings that allow the use of national languages. It was decided not to reinvent the wheel and take ASCII as a basis. The encoding table in the new edition has expanded significantly. The use of the 8th bit made it possible to translate into computer language already 256 characters.

Description

The ASCII encoding has a table that is divided into 2 parts. Only its first half is considered to be a generally accepted international standard. It includes:

  • Symbols with serial numbers from 0 to 31, encoded by sequences from 00000000 to 00011111. They are reserved for control characters that control the process of displaying text on the screen or printer, feeding sound signal and so on.
  • Characters with NN in the table from 32 to 127, encoded by sequences from 00100000 to 01111111 form the standard part of the table. These include space (N 32), letters of the Latin alphabet (lowercase and uppercase), ten-digit numbers from 0 to 9, punctuation marks, brackets of different styles and other symbols.
  • Characters with serial numbers from 128 to 255, encoded by sequences from 10000000 to 11111111. These include letters of national alphabets other than Latin. It is this alternative part of the ASCII table that is used to convert Russian characters into computer form.

Some properties

Features of the ASCII encoding include the difference between the letters “A” - “Z” of lower and upper case by only one bit. This circumstance greatly simplifies register conversion, as well as checking whether it belongs to a given range of values. In addition, all letters in the ASCII encoding system are represented by their own sequence numbers in the alphabet, which are written in 5 digits in binary system Numbers preceded by 011 2 for lowercase letters and 010 2 for uppercase letters.

One of the features of the ASCII encoding is the representation of 10 digits - “0” - “9”. In the second number system they start with 00112 and end with 2 number values. So, 0101 2 is equivalent decimal number five, so the symbol "5" is written as 0011 01012. Based on what has been said, you can easily convert binary decimal numbers to an ASCII string by adding the bit sequence 00112 to each nibble on the left.

"Unicode"

As you know, thousands of characters are required to display texts in the languages ​​of the Southeast Asian group. Such a number of them cannot be described in any way in one byte of information, so even extended versions of ASCII could no longer satisfy the increased needs of users from different countries.

Thus, the need arose to create a universal text encoding, the development of which, in collaboration with many leaders of the global IT industry, was undertaken by the Unicode consortium. Its specialists created the UTF 32 system. In it, 32 bits were allocated to encode 1 character, constituting 4 bytes of information. The main disadvantage was the sharp increase in volume required memory as much as 4 times, which entailed many problems.

At the same time, for most countries with official languages ​​belonging to the Indo-European group, the number of characters equal to 2 32 is more than excessive.

As a result of further work by specialists from the Unicode consortium, the UTF-16 encoding appeared. It became the option for converting symbolic information that suited everyone both in terms of the amount of memory required and the number of encoded characters. That is why UTF-16 was adopted by default and requires 2 bytes to be reserved for one character.

Even this fairly advanced and successful version of Unicode had some drawbacks, and after the transition from the extended version of ASCII to UTF-16, the weight of the document doubled.

In this regard, it was decided to use UTF-8 variable length encoding. In this case, each character of the source text is encoded as a sequence of length from 1 to 6 bytes.

Contact American standard code for information interchange

All Latin characters in UTF-8 variable length are encoded into 1 byte, as in the ASCII encoding system.

A special feature of YTF-8 is that in the case of text in Latin without using other characters, even programs that do not understand Unicode will still be able to read it. In other words, the basic part of the encoding ASCII text simply becomes part of a new variable-length UTF. Cyrillic characters in YTF-8 occupy 2 bytes, and, for example, Georgian characters - 3 bytes. By creating UTF-16 and 8, the main problem of creating a single code space in fonts was solved. Since then, font manufacturers can only fill the table with vector forms of text characters based on their needs.

In different operating systems preference is given different encodings. To be able to read and edit texts typed in a different encoding, Russian text conversion programs are used. Some text editors contain built-in transcoders and allow you to read text regardless of encoding.

Now you know how many characters are in the ASCII encoding and how and why it was developed. Of course, today the Unicode standard is most widespread in the world. However, we must not forget that it is based on ASCII, so the contribution of its developers to the IT field should be appreciated.

[8-bit encodings: ASCII, KOI-8R and CP1251] The first encoding tables created in the United States did not use the eighth bit in a byte. The text was represented as a sequence of bytes, but the eighth bit was not taken into account (it was used for official purposes).

The table has become a generally accepted standard ASCII(American Standard Code for Information Interchange). The first 32 characters of the ASCII table (00 to 1F) were used for non-printing characters. They were designed to control a printing device, etc. The rest - from 20 to 7F - are regular (printable) characters.

Table 1 - ASCII encoding

DecHexOctCharDescription
0 0 000 null
1 1 001 start of heading
2 2 002 start of text
3 3 003 end of text
4 4 004 end of transmission
5 5 005 inquiry
6 6 006 acknowledge
7 7 007 bell
8 8 010 backspace
9 9 011 horizontal tab
10 A 012 new line
11 B 013 vertical tab
12 C 014 new page
13 D 015 carriage return
14 E 016 shift out
15 F 017 shift in
16 10 020 data link escape
17 11 021 device control 1
18 12 022 device control 2
19 13 023 device control 3
20 14 024 device control 4
21 15 025 negative acknowledge
22 16 026 synchronous idle
23 17 027 end of trans. block
24 18 030 cancel
25 19 031 end of medium
26 1A 032 substitute
27 1B 033 escape
28 1C 034 file separator
29 1D 035 group separator
30 1E 036 record separator
31 1F 037 unit separator
32 20 040 space
33 21 041 !
34 22 042 "
35 23 043 #
36 24 044 $
37 25 045 %
38 26 046 &
39 27 047 "
40 28 050 (
41 29 051 )
42 2A 052 *
43 2B 053 +
44 2C 054 ,
45 2D 055 -
46 2E 056 .
47 2F 057 /
48 30 060 0
49 31 061 1
50 32 062 2
51 33 063 3
52 34 064 4
53 35 065 5
54 36 066 6
55 37 067 7
56 38 070 8
57 39 071 9
58 3A 072 :
59 3B 073 ;
60 3C 074 <
61 3D 075 =
62 3E 076 >
63 3F 077 ?
DecHexOctChar
64 40 100 @
65 41 101 A
66 42 102 B
67 43 103 C
68 44 104 D
69 45 105 E
70 46 106 F
71 47 107 G
72 48 110 H
73 49 111 I
74 4A 112 J
75 4B 113 K
76 4C 114 L
77 4D 115 M
78 4E 116 N
79 4F 117 O
80 50 120 P
81 51 121 Q
82 52 122 R
83 53 123 S
84 54 124 T
85 55 125 U
86 56 126 V
87 57 127 W
88 58 130 X
89 59 131 Y
90 5A 132 Z
91 5B 133 [
92 5C 134 \
93 5D 135 ]
94 5E 136 ^
95 5F 137 _
96 60 140 `
97 61 141 a
98 62 142 b
99 63 143 c
100 64 144 d
101 65 145 e
102 66 146 f
103 67 147 g
104 68 150 h
105 69 151 i
106 6A 152 j
107 6B 153 k
108 6C 154 l
109 6D 155 m
110 6E 156 n
111 6F 157 o
112 70 160 p
113 71 161 q
114 72 162 r
115 73 163 s
116 74 164 t
117 75 165 u
118 76 166 v
119 77 167 w
120 78 170 x
121 79 171 y
122 7A 172 z
123 7B 173 {
124 7C 174 |
125 7D 175 }
126 7E 176 ~
127 7F 177 DEL

As is easy to see, in this encoding only letters, and those that are used in English. There are also arithmetic and other service symbols. But there are neither Russian letters, nor even special Latin ones for German or French. This is easy to explain - the encoding was developed specifically as an American standard. As computers began to be used throughout the world, other characters needed to be encoded.

To do this, it was decided to use the eighth bit in each byte. This made 128 more values ​​available (from 80 to FF) that could be used to encode characters. The first of the eight-bit tables is “extended ASCII” ( Extended ASCII) - included various options Latin characters used in some languages Western Europe. It also contained other additional symbols, including pseudographics.

Pseudographic characters allow, by displaying only text characters, provide some semblance of graphics. Using pseudographics, for example, a control program works FAR files Manager.

There were no Russian letters in the Extended ASCII table. Russia (formerly the USSR) and other countries created their own encodings that made it possible to represent specific “national” characters in 8-bit text files - Latin letters of the Polish and Czech languages, Cyrillic (including Russian letters) and other alphabets.

In all encodings that have become widespread, the first 127 characters (that is, the byte value with the eighth bit equal to 0) are the same as ASCII. So an ASCII file works in either of these encodings; letters in English they are presented equally.

Organization ISO(International Standardization Organization) adopted a group of standards ISO 8859. It defines 8-bit encodings for different groups of languages. So, ISO 8859-1 is an Extended ASCII table for the USA and Western Europe. And ISO 8859-5 is a table for the Cyrillic alphabet (including Russian).

However, for historical reasons, the ISO 8859-5 encoding did not take root. In reality, the following encodings are used for the Russian language:

Code Page 866 ( CP866), aka “DOS”, aka “alternative GOST encoding”. Widely used until the mid-90s; now used to a limited extent. Practically not used for distributing texts on the Internet.
- KOI-8. Developed in the 70-80s. Is a generally accepted standard for transmission mail messages V Russian Internet. Widely used in operating systems Unix family, including Linux. The KOI-8 version, designed for Russian, is called KOI-8R; There are versions for other Cyrillic languages ​​(for example, KOI8-U is a version for the Ukrainian language).
- Code Page 1251, CP1251,Windows-1251. Developed by Microsoft to support the Russian language in Windows.

The main advantage of the CP866 was the preservation of pseudo-graphics characters in the same places as in Extended ASCII; therefore, foreign ones could work without changes text programs, for example, the famous Norton Commander. The CP866 is now used for Windows programs running in text windows or full-screen text mode, including FAR Manager.

Texts in CP866 last years are quite rare (but it is used to encode Russian file names in Windows). Therefore, we will dwell in more detail on two other encodings - KOI-8R and CP1251.



As you can see, in the CP1251 encoding table, Russian letters are located in alphabetical order(except, however, for the letter E). Thanks to this location computer programs It's very easy to sort alphabetically.

But in KOI-8R the order of Russian letters seems random. But actually it is not.

In many older programs, the 8th bit was lost when processing or transmitting text. (Now such programs are practically “extinct”, but in the late 80s - early 90s they were widespread). To get a 7-bit value from an 8-bit value, just subtract 8 from the most significant digit; for example, E1 becomes 61.

Now compare KOI-8R with the ASCII table (Table 1). You will find that Russian letters are placed in clear correspondence with Latin ones. If the eighth bit disappears, lowercase Russian letters turn into uppercase Latin letters, and uppercase Russian letters turn into lowercase Latin letters. So, E1 in KOI-8 is the Russian “A”, while 61 in ASCII is the Latin “a”.

So, KOI-8 allows you to maintain the readability of Russian text when the 8th bit is lost. “Hello everyone” becomes “pRIWET WSEM”.

IN Lately both the alphabetical order of characters in the encoding table and readability with the loss of the 8th bit lost their decisive importance. Eighth bit in modern computers is not lost during transmission or processing. And alphabetical sorting is done taking into account the encoding, and not by simply comparing codes. (By the way, the CP1251 codes are not completely arranged alphabetically - the letter E is not in its place).

Due to the fact that there are two common encodings, when working with the Internet (mail, browsing Web sites), you can sometimes see a meaningless set of letters instead of Russian text. For example, “I AM SBYUFEMHEL.” These are just the words “with respect”; but they were encoded in CP1251 encoding, and the computer decoded the text using the KOI-8 table. If the same words, on the contrary, were encoded in KOI-8, and the computer decoded the text according to the CP1251 table, the result would be “U HCHBTSEOYEN”.

Sometimes it happens that a computer deciphers Russian-language letters using a table not intended for the Russian language. Then, instead of Russian letters, a meaningless set of symbols appears (for example, Latin letters of Eastern European languages); they are often called “crocozybras”.

In most cases modern programs cope with determining the encodings of Internet documents ( emails and Web pages) independently. But sometimes they “misfire”, and then you can see strange sequences of Russian letters or “krokozyabry”. As a rule, in such a situation, to display real text on the screen, it is enough to select the encoding manually in the program menu.

Information from the page http://open-office.edusite.ru/TextProcessor/p5aa1.html was used for this article.

Material taken from the site:

According to the International Telecommunication Union, in 2016, three and a half billion people used the Internet with some regularity. Most of them don't even think about the fact that any messages they send via PC or mobile gadgets, as well as texts that are displayed on all kinds of monitors, are actually combinations of 0 and 1. This representation of information is called encoding. It ensures and greatly facilitates its storage, processing and transmission. In 1963, the American ASCII encoding was developed, which is the subject of this article.

Presenting information on a computer

From the point of view of any electronic computer, text is a set of individual characters. These include not only letters, including capital ones, but also punctuation marks and numbers. In addition, special characters “=”, “&”, “(” and spaces are used.

The set of characters that make up the text is called the alphabet, and their number is called cardinality (denoted as N). To determine it, the expression N = 2^b is used, where b is the number of bits or the information weight of a particular symbol.

It has been proven that an alphabet with a capacity of 256 characters can represent all the necessary characters.

Since 256 represents the 8th power of two, the weight of each character is 8 bits.

A unit of measurement of 8 bits is called 1 byte, so it is customary to say that any character in text stored on a computer takes up one byte of memory.

How is coding done?

Any texts are entered into memory personal computer through keyboard keys on which numbers, letters, punctuation marks and other symbols are written. IN RAM they are transmitted in binary code, i.e. each character is associated with a decimal code familiar to humans, from 0 to 255, which corresponds to a binary code - from 00000000 to 11111111.

Byte-byte character encoding allows the processor performing text processing to access each character individually. At the same time, 256 characters are quite enough to represent any symbolic information.

ASCII character encoding

This abbreviation in English stands for code for information interchange.

Even at the dawn of computerization, it became obvious that it was possible to come up with a wide variety of ways to encode information. However, to transfer information from one computer to another, it was necessary to develop a unified standard. So, in 1963, the ASCII encoding table appeared in the USA. In it, any symbol of the computer alphabet is associated with its serial number in binary representation. ASCII was originally used only in the United States and later became an international standard for PCs.

ASCII codes are divided into 2 parts. Only the first half of this table is considered the international standard. It includes characters with serial numbers from 0 (coded as 00000000) to 127 (coded 01111111).

Serial number

ASCII text encoding

Symbol

0000 0000 - 0001 1111

Characters with N from 0 to 31 are called control characters. Their function is to “manage” the process of displaying text on a monitor or printing device, giving a sound signal, etc.

0010 0000 - 0111 1111

Characters with N from 32 to 127 (standard part of the table) - uppercase and lower case Latin alphabet, 10th digits, punctuation marks, as well as various brackets, commercial and other symbols. The character 32 represents a space.

1000 0000 - 1111 1111

Characters with N from 128 to 255 (alternative part of the table or code page) can have different variants, each of which has its own number. The code page is used to specify national alphabets that are different from Latin. In particular, it is with its help that ASCII encoding for Russian characters is carried out.

In the table, the encodings are capitalized and follow each other in alphabetical order, and the numbers are in ascending order. This principle remains the same for the Russian alphabet.

Control characters

The ASCII encoding table was originally created for receiving and transmitting information via a device that has not been used for a long time, such as a teletype. In this regard, non-printable characters were included in the character set, used as commands to control this device. Similar commands were used in such pre-computer messaging methods as Morse code, etc.

The most common teletype character is NUL (00). It is still used today in most programming languages ​​to indicate the end of a line.

Where is ASCII encoding used?

American standard code necessary not only for input text information from the keyboard. It is also used in graphics. In particular, in the ASCII Art Maker program, images various extensions represent a spectrum of ASCII characters.

There are two types of similar products: those that perform a function graphic editors by converting images to text and converting “drawings” to ASCII graphics. For example, the famous emoticon is a prime example of an encoding symbol.

ASCII can also be used to create HTML document. In this case, you can enter a certain set of characters, and when viewing the page, a symbol that corresponds to this code will appear on the screen.

ASCII is also necessary for creating multilingual websites, since characters that are not included in a specific national table are replaced with ASCII codes.

Some features

ASCII was originally used to encode text information using 7 bits (one was left blank), but today it works as 8 bits.

The letters located in the columns located above and below differ from each other in only one single bit. This significantly reduces the complexity of the audit.

Using ASCII in Microsoft Office

If necessary, this type of encoding of text information can be used in text editors Microsoft corporations such as Notepad and Office Word. However, you may not be able to use some functions when typing in this case. For example, you won't be able to use bold text because ASCII encoding only preserves the meaning of the information, ignoring its general appearance and form.

Standardization

The ISO organization has adopted ISO 8859 standards. This group defines eight-bit encodings for different language groups. Specifically, ISO 8859-1 is an Extended ASCII table for the United States and Western European countries. And ISO 8859-5 is a table used for the Cyrillic alphabet, including the Russian language.

For a number of historical reasons, the ISO 8859-5 standard was used for a very short time.

For Russian language this moment The actual encodings used are:

  • CP866 (Code Page 866) or DOS, which is often called alternative GOST encoding. It was actively used until the mid-90s of the last century. At the moment it is practically not used.
  • KOI-8. The encoding was developed in the 1970s and 80s, and is currently the generally accepted standard for email messages on the RuNet. It is widely used in Unix operating systems, including Linux. The “Russian” version of KOI-8 is called KOI-8R. In addition, there are versions for other Cyrillic languages, such as Ukrainian.
  • Code Page 1251 (CP 1251, Windows - 1251). Developed by Microsoft to provide support for the Russian language in the Windows environment.

The main advantage of the first CP866 standard was the preservation of pseudographic characters in the same positions as in Extended ASCII. This made it possible to run foreign-made text programs, such as the famous Norton Commander, without modifications. Currently, CP866 is used for programs developed for Windows that run in full-screen text mode or in text windows, including FAR Manager.

Computer texts written in CP866 encoding are quite rare these days, but it is the one that is used for Russian file names in Windows.

"Unicode"

At the moment, this encoding is the most widely used. Unicode codes are divided into areas. The first (U+0000 to U+007F) includes ASCII characters with codes. This is followed by the character areas of various national scripts, as well as punctuation marks and technical symbols. In addition, some Unicode codes are reserved in case there is a need to include new characters in the future.

Now you know that in ASCII, each character is represented as a combination of 8 zeros and ones. To non-specialists, this information may seem unnecessary and uninteresting, but don’t you want to know what’s going on “in the brains” of your PC?!

Unicode (Unicode in English) is a character encoding standard. Simply put, this is a table of correspondence between text characters ( , letters, punctuation elements) binary codes. The computer only understands the sequence of zeros and ones. In order for it to know what exactly it should display on the screen, it is necessary to assign each symbol its own unique number. In the eighties, characters were encoded in one byte, that is, eight bits (each bit is a 0 or 1). Thus, it turned out that one table (aka encoding or set) can only accommodate 256 characters. This may not be enough even for one language. Therefore, there were many different encodings, confusion with which often led to some strange gibberish appearing on the screen instead of readable text. A single standard was required, which is what Unicode became. The most used encoding is UTF-8 (Unicode Transformation Format), which uses 1 to 4 bytes to represent a character.

Symbols

Characters in Unicode tables are numbered hexadecimal numbers. For example, Cyrillic capital letter M is designated U+041C. This means that it stands at the intersection of row 041 and column C. You can simply copy it and then paste it somewhere. In order not to rummage through a multi-kilometer list, you should use the search. When you go to the symbol page, you will see its Unicode number and how it is written in different fonts. You can enter the sign itself into the search bar, even if a square is drawn instead, at least to find out what it was. Also, on this site there are special (and random) sets of the same type of icons, collected from different sections, for ease of use.

The Unicode standard is international. It includes characters from almost all scripts of the world. Including those that are no longer used. Egyptian hieroglyphs, Germanic runes, Mayan writing, cuneiform and alphabets of ancient states. Designations of weights and measures, musical notation, and mathematical concepts are also presented.

The Unicode Consortium itself does not invent new characters. Those icons that find their use in society are added to the tables. For example, the ruble sign was actively used for six years before it was added to Unicode. Emoji pictograms (emoticons) were also first widely used in Japan before they were included in the encoding. But trademarks and company logos are not added in principle. Even such common ones as apple Apple or Windows flag. To date, about 120 thousand characters are encoded in version 8.0.

Character overlay

The BS (backspace) character allows the printer to print one character on top of another. ASCII provided for adding diacritics to letters in this way, for example:

  • a BS "→ á
  • a BS ` → à
  • a BS ^ → â
  • o BS / → ø
  • c BS , → ç
  • n BS ~ → с

Note: in old fonts, the apostrophe " was drawn slanted to the left, and the tilde ~ was shifted up, so they just fit the role of an acute and a tilde on top.

If the same symbol is superimposed on a symbol, the effect is bold, and if an underscore is superimposed on a character, then underlined text is obtained.

  • a BS a → a
  • aBS_→ a

Note: this is used for example in help system man.

National ASCII variants

The ISO 646 (ECMA-6) standard provides for the possibility of placing national symbols in place @ [ \ ] ^ ` { | } ~ . In addition to this, on site # can be posted £ , and in place $ - ¤ . This system is well suited for European languages ​​where only a few extra characters are needed. The version of ASCII without national characters is called US-ASCII, or "International Reference Version".

Subsequently, it turned out to be more convenient to use 8-bit encodings (code pages), where the lower half of the code table (0-127) is occupied by US-ASCII characters, and the upper half (128-255) by additional characters, including a set of national characters. So the top half of the ASCII table is up to widespread implementation Unicode has been extensively used to represent localized characters, letters of the local language. Lack of a unified standard for placing Cyrillic characters in ASCII table caused many problems with encodings (KOI-8, Windows-1251 and others). Other languages ​​with non-Latin scripts also suffered from having several different encodings.

.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .A .B .C .D .E .F
0. NUL SOM EOA EOM EQT W.R.U. RU BELL BKSP HT LF VT FF CR SO S.I.
1. DC 0 DC 1 DC 2 DC 3 DC 4 ERR SYNC L.E.M. S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7
2.
3.
4. BLANK ! " # $ % & " ( ) * + , - . /
5. 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
6.
7.
8.
9.
A. @ A B C D E F G H I J K L M N O
B. P Q R S T U V W X Y Z [ \ ]
C.
D.
E. a b c d e f g h i j k l m n o
F. p q r s t u v w x y z ESC DEL

On those computers where the minimum addressable unit of memory was a 36-bit word, 6-bit characters were initially used (1 word = 6 characters). After the transition to ASCII, such computers began to contain either 5 seven-bit characters (1 bit remained extra) or 4 nine-bit characters in one word.

ASCII codes are also used to determine which key is pressed during programming. For a standard QWERTY keyboard, the code table looks like this:







2024 gtavrl.ru.