Latin character codes. Encoding text information


According to the International Telecommunication Union, in 2016, three and a half billion people used the Internet with some regularity. Most of them don't even think about the fact that any messages they send via PC or mobile gadgets, as well as texts that are displayed on all kinds of monitors, are actually combinations of 0 and 1. This representation of information is called encoding. It ensures and greatly facilitates its storage, processing and transmission. In 1963, the American ASCII encoding was developed, which is the subject of this article.

Presenting information on a computer

From the point of view of any electronic computer, text is a set of individual characters. These include not only letters, including capital ones, but also punctuation marks and numbers. In addition, special characters “=”, “&”, “(” and spaces are used.

The set of characters that make up the text is called the alphabet, and their number is called cardinality (denoted as N). To determine it, the expression N = 2^b is used, where b is the number of bits or the information weight of a particular symbol.

It has been proven that an alphabet with a capacity of 256 characters can represent all the necessary characters.

Since 256 represents the 8th power of two, the weight of each character is 8 bits.

A unit of measurement of 8 bits is called 1 byte, so it is customary to say that any character in text stored on a computer takes up one byte of memory.

How is coding done?

Any texts are entered into memory personal computer through keyboard keys on which numbers, letters, punctuation marks and other symbols are written. IN RAM they are transmitted in binary code, i.e. each character is associated with a decimal code familiar to humans, from 0 to 255, which corresponds to binary code- from 00000000 to 11111111.

Byte-byte character encoding allows the processor performing text processing to access each character individually. At the same time, 256 characters are quite enough to represent any symbolic information.

ASCII character encoding

This abbreviation in English stands for code for information interchange.

Even at the dawn of computerization, it became obvious that it was possible to come up with a wide variety of ways to encode information. However, to transfer information from one computer to another, it was necessary to develop a unified standard. So, in 1963, the ASCII encoding table appeared in the USA. In it, any symbol of the computer alphabet is associated with its serial number in binary representation. ASCII was originally used only in the United States and later became an international standard for PCs.

ASCII codes are divided into 2 parts. Only the first half of this table is considered the international standard. It includes characters with serial numbers from 0 (coded as 00000000) to 127 (coded as 01111111).

Serial number

ASCII text encoding

Symbol

0000 0000 - 0001 1111

Characters with N from 0 to 31 are called control characters. Their function is to “manage” the process of displaying text on a monitor or printing device, giving a sound signal, etc.

0010 0000 - 0111 1111

Characters with N from 32 to 127 (standard part of the table) - uppercase and lower case Latin alphabet, 10th digits, punctuation marks, as well as various brackets, commercial and other symbols. The character 32 represents a space.

1000 0000 - 1111 1111

Characters with N from 128 to 255 (alternative part of the table or code page) can have different variants, each of which has its own number. The code page is used to specify national alphabets that are different from Latin. In particular, it is with its help that ASCII encoding for Russian characters is carried out.

In the table, the encodings are capitalized and follow each other in alphabetical order, and the numbers are in ascending order. This principle remains the same for the Russian alphabet.

Control characters

The ASCII encoding table was originally created for receiving and transmitting information via a device that has not been used for a long time, such as a teletype. In this regard, non-printable characters were included in the character set, used as commands to control this device. Similar commands were used in such pre-computer messaging methods as Morse code, etc.

The most common teletype character is NUL (00). It is still used today in most programming languages ​​to indicate the end of a line.

Where is ASCII encoding used?

American standard code necessary not only for input text information from the keyboard. It is also used in graphics. In particular, in the ASCII Art Maker program, images various extensions represent a spectrum of ASCII characters.

There are two types of similar products: those that perform a function graphic editors by converting images to text and converting “drawings” to ASCII graphics. For example, the famous emoticon is a prime example of an encoding symbol.

ASCII can also be used to create HTML document. In this case, you can enter a certain set of characters, and when viewing the page, a symbol that corresponds to this code will appear on the screen.

ASCII is also necessary for creating multilingual websites, since characters that are not included in a specific national table are replaced with ASCII codes.

Some features

ASCII was originally used to encode text information using 7 bits (one was left blank), but today it works as 8 bits.

The letters located in the columns located above and below differ from each other in only one single bit. This significantly reduces the complexity of the audit.

Application of ASCII in Microsoft Office

If necessary, this type of encoding of text information can be used in text editors Microsoft corporations such as Notepad and Office Word. However, you may not be able to use some functions when typing in this case. For example, you will not be able to select in bold, since ASCII encoding preserves only the meaning of the information, ignoring its general appearance and form.

Standardization

The ISO organization has adopted ISO 8859 standards. This group defines eight-bit encodings for different language groups. Specifically, ISO 8859-1 is Extended ASCII, which is a table for the United States and countries Western Europe. And ISO 8859-5 is a table used for the Cyrillic alphabet, including the Russian language.

For a number of historical reasons, the ISO 8859-5 standard was used for a very short time.

For Russian language this moment The actual encodings used are:

  • CP866 (Code Page 866) or DOS, which is often called alternative GOST encoding. It was actively used until the mid-90s of the last century. At the moment it is practically not used.
  • KOI-8. The encoding was developed in the 1970s and 80s, and is now the generally accepted standard for mail messages in Runet. It is widely used in OS Unix family, including Linux. The “Russian” version of KOI-8 is called KOI-8R. In addition, there are versions for other Cyrillic languages, for example Ukrainian.
  • Code Page 1251 (CP 1251, Windows - 1251). Developed by Microsoft to provide support for the Russian language in the Windows environment.

The main advantage of the first CP866 standard was the preservation of pseudographic characters in the same positions as in Extended ASCII. This allowed it to run without changes text programs, foreign production, such as the famous Norton Commander. Currently, CP866 is used for programs developed for Windows that run in full-screen text mode or in text windows, including FAR Manager.

Computer texts written in CP866 encoding, in Lately They are quite rare, but it is the one that is used for Russian file names in Windows.

"Unicode"

At the moment, this encoding is the most widely used. Unicode codes are divided into areas. The first (U+0000 to U+007F) includes ASCII characters with codes. This is followed by the character areas of various national scripts, as well as punctuation marks and technical symbols. In addition, some Unicode codes are reserved in case there is a need to include new characters in the future.

Now you know that in ASCII, each character is represented as a combination of 8 zeros and ones. To non-specialists, this information may seem unnecessary and uninteresting, but don’t you want to know what’s going on “in the brains” of your PC?!

[8-bit encodings: ASCII, KOI-8R and CP1251] The first encoding tables created in the USA did not use the eighth bit in a byte. The text was represented as a sequence of bytes, but the eighth bit was not taken into account (it was used for official purposes).

The ASCII table (American Standard Code for Information Interchange). The first 32 characters of the ASCII table (00 to 1F) were used for non-printing characters. They were designed to control a printing device, etc. The rest - from 20 to 7F - are regular (printable) characters.

Table 1 - ASCII encoding

Dec Hex Oct Char Description
0 0 000 null
1 1 001 start of heading
2 2 002 start of text
3 3 003 end of text
4 4 004 end of transmission
5 5 005 inquiry
6 6 006 acknowledge
7 7 007 bell
8 8 010 backspace
9 9 011 horizontal tab
10 A 012 new line
11 B 013 vertical tab
12 C 014 new page
13 D 015 carriage return
14 E 016 shift out
15 F 017 shift in
16 10 020 data link escape
17 11 021 device control 1
18 12 022 device control 2
19 13 023 device control 3
20 14 024 device control 4
21 15 025 negative acknowledge
22 16 026 synchronous idle
23 17 027 end of trans. block
24 18 030 cancel
25 19 031 end of medium
26 1A 032 substitute
27 1B 033 escape
28 1C 034 file separator
29 1D 035 group separator
30 1E 036 record separator
31 1F 037 unit separator
32 20 040 space
33 21 041 !
34 22 042 "
35 23 043 #
36 24 044 $
37 25 045 %
38 26 046 &
39 27 047 "
40 28 050 (
41 29 051 )
42 2A 052 *
43 2B 053 +
44 2C 054 ,
45 2D 055 -
46 2E 056 .
47 2F 057 /
48 30 060 0
49 31 061 1
50 32 062 2
51 33 063 3
52 34 064 4
53 35 065 5
54 36 066 6
55 37 067 7
56 38 070 8
57 39 071 9
58 3A 072 :
59 3B 073 ;
60 3C 074 <
61 3D 075 =
62 3E 076 >
63 3F 077 ?
Dec Hex Oct Char
64 40 100 @
65 41 101 A
66 42 102 B
67 43 103 C
68 44 104 D
69 45 105 E
70 46 106 F
71 47 107 G
72 48 110 H
73 49 111 I
74 4A 112 J
75 4B 113 K
76 4C 114 L
77 4D 115 M
78 4E 116 N
79 4F 117 O
80 50 120 P
81 51 121 Q
82 52 122 R
83 53 123 S
84 54 124 T
85 55 125 U
86 56 126 V
87 57 127 W
88 58 130 X
89 59 131 Y
90 5A 132 Z
91 5B 133 [
92 5C 134 \
93 5D 135 ]
94 5E 136 ^
95 5F 137 _
96 60 140 `
97 61 141 a
98 62 142 b
99 63 143 c
100 64 144 d
101 65 145 e
102 66 146 f
103 67 147 g
104 68 150 h
105 69 151 i
106 6A 152 j
107 6B 153 k
108 6C 154 l
109 6D 155 m
110 6E 156 n
111 6F 157 o
112 70 160 p
113 71 161 q
114 72 162 r
115 73 163 s
116 74 164 t
117 75 165 u
118 76 166 v
119 77 167 w
120 78 170 x
121 79 171 y
122 7A 172 z
123 7B 173 {
124 7C 174 |
125 7D 175 }
126 7E 176 ~
127 7F 177 DEL

As is easy to see, in this encoding only letters, and those that are used in English. There are also arithmetic and other service symbols. But there are neither Russian letters, nor even special Latin ones for German or French. This is easy to explain - the encoding was developed specifically as an American standard. As computers began to be used throughout the world, other characters needed to be encoded.

To do this, it was decided to use the eighth bit in each byte. This made 128 more values ​​available (from 80 to FF) that could be used to encode characters. The first of the eight-bit tables - “extended ASCII” ( Extended ASCII) - included various variants of Latin characters used in some languages ​​of Western Europe. It also contained other additional symbols, including pseudographics.

Pseudographic characters allow, by displaying only text characters, provide some semblance of graphics. Using pseudographics, for example, a control program works FAR files Manager.

There were no Russian letters in the Extended ASCII table. Russia (formerly the USSR) and other countries created their own encodings that made it possible to represent specific “national” characters in 8-bit text files - Latin letters of the Polish and Czech languages, Cyrillic (including Russian letters) and other alphabets.

In all encodings that have become widespread, the first 127 characters (that is, the byte value with the eighth bit equal to 0) are the same as ASCII. So an ASCII file works in either of these encodings; letters in English they are presented equally.

The ISO organization (International Standardization Organization) has adopted the ISO 8859 group of standards. It defines 8-bit encodings for different groups of languages. So, ISO 8859-1 is an Extended ASCII table for the USA and Western Europe. And ISO 8859-5 is a table for the Cyrillic alphabet (including Russian).

However, for historical reasons, the ISO 8859-5 encoding did not take root. In reality, the following encodings are used for the Russian language:

Code Page 866 (CP866), aka “DOS”, aka “alternative GOST encoding”. Widely used until the mid-90s; now used to a limited extent. Practically not used for distributing texts on the Internet.
- KOI-8. Developed in the 70-80s. Is a generally accepted standard for the transmission of mail messages in Russian Internet. Also widely used in operating systems Unix family, including Linux. The Russian-language version of KOI-8 is called KOI-8R; There are versions for other Cyrillic languages ​​(for example, KOI8-U is a version for the Ukrainian language).
- Code Page 1251, CP1251, Windows-1251. Developed by Microsoft to support the Russian language in Windows.

The main advantage of the CP866 was the preservation of pseudo-graphics characters in the same places as in Extended ASCII; therefore, foreign text programs, for example, the famous Norton Commander, could work without changes. The CP866 is now used for Windows programs running in text windows or full-screen text mode, including FAR Manager.

Texts in CP866 last years are quite rare (but it is used to encode Russian file names in Windows). Therefore, we will dwell in more detail on two other encodings - KOI-8R and CP1251.



As you can see, in the CP1251 encoding table, Russian letters are arranged in alphabetical order (with the exception, however, of the letter E). Thanks to this location computer programs It's very easy to sort alphabetically.

But in KOI-8R the order of Russian letters seems random. But actually it is not.

In many older programs, the 8th bit was lost when processing or transmitting text. (Now such programs are practically “extinct”, but in the late 80s - early 90s they were widespread). To get a 7-bit value from an 8-bit value, just subtract 8 from the most significant digit; for example, E1 becomes 61.

Now compare KOI-8R with the ASCII table (Table 1). You will find that Russian letters are placed in clear correspondence with Latin ones. If the eighth bit disappears, lowercase Russian letters turn into uppercase Latin letters, and uppercase Russian letters turn into lowercase Latin letters. So, E1 in KOI-8 is the Russian “A”, while 61 in ASCII is the Latin “a”.

So, KOI-8 allows you to maintain the readability of Russian text when the 8th bit is lost. “Hello everyone” becomes “pRIWET WSEM”.

Lately and alphabet order The arrangement of characters in the encoding table, and readability with the loss of the 8th bit, have lost their decisive importance. Eighth bit in modern computers is not lost during transmission or processing. And alphabetical sorting is done taking into account the encoding, and not by simply comparing codes. (By the way, the CP1251 codes are not completely arranged alphabetically - the letter E is not in its place).

Due to the fact that there are two common encodings, when working with the Internet (mail, browsing Web sites), you can sometimes see a meaningless set of letters instead of Russian text. For example, “I AM SBYUFEMHEL.” These are just the words “with respect”; but they were encoded in CP1251 encoding, and the computer decoded the text using the KOI-8 table. If the same words, on the contrary, were encoded in KOI-8, and the computer decoded the text according to the CP1251 table, the result would be “U HCHBTSEOYEN”.

Sometimes it happens that a computer deciphers Russian-language letters using a table not intended for the Russian language. Then, instead of Russian letters, a meaningless set of symbols appears (for example, Latin letters of Eastern European languages); they are often called “crocozybras”.

In most cases modern programs cope with determining the encodings of Internet documents ( emails and Web pages) independently. But sometimes they “misfire”, and then you can see strange sequences of Russian letters or “krokozyabry”. As a rule, in such a situation, to display real text on the screen, it is enough to select the encoding manually in the program menu.

Information from the page http://open-office.edusite.ru/TextProcessor/p5aa1.html was used for this article.

Material taken from the site:

As you know, a computer stores information in binary form, representing it as a sequence of ones and zeros. To translate information into a form convenient for human perception, each unique sequence of numbers is replaced by its corresponding symbol when displayed.

One of the systems for correlating binary codes with printed and control characters is

At today's level of development computer technology the user is not required to know the code of each specific character. However general understanding how coding is carried out is extremely useful, and for some categories of specialists even necessary.

Creating ASCII

The encoding was originally developed in 1963 and then updated twice over the course of 25 years.

In the original version, the ASCII character table included 128 characters; later an extended version appeared, where the first 128 characters were saved, and previously missing characters were assigned to codes with the eighth bit involved.

For many years this encoding was the most popular in the world. In 2006, Latin 1252 took the leading position, and from the end of 2007 to the present, Unicode has firmly held the leading position.

Computer representation of ASCII

Each ASCII character has its own code, consisting of 8 characters representing a zero or a one. The minimum number in this representation is zero (eight zeros in the binary system), which is the code of the first element in the table.

Two codes in the table were reserved for switching between standard US-ASCII and its national variant.

After ASCII began to include not 128, but 256 characters, an encoding variant became widespread, in which the original version of the table was stored in the first 128 codes with the 8th bit zero. National written characters were stored in the upper half of the table (positions 128-255).

The user does not need to know the ASCII character codes directly. To the developer software Usually it is enough to know the number of the element in the table in order, if necessary, to calculate its code using the binary system.

Russian language

After the development in the early 70s of encodings for the Scandinavian languages, Chinese, Korean, Greek, etc., the creation own version The Soviet Union also became involved. Soon, a version of an 8-bit encoding called KOI8 was developed, preserving the first 128 ASCII character codes and allocating the same number of positions for letters of the national alphabet and additional characters.

Before the introduction of Unicode, KOI8 dominated the Russian segment of the Internet. There were encoding options for both the Russian and Ukrainian alphabet.

ASCII problems

Since the number of elements even in the extended table did not exceed 256, there was no possibility of accommodating several different scripts in one encoding. In the 90s, the “crocozyabr” problem appeared on the Runet, when texts typed in Russian ASCII characters were displayed incorrectly.

The problem was a code mismatch various options ASCII to each other. Let us remember that various characters could be located in positions 128-255, and when changing one Cyrillic encoding to another, all letters of the text were replaced with others having an identical number in a different version of the encoding.

Current state

With the advent of Unicode, the popularity of ASCII began to decline sharply.

The reason for this lies in the fact that the new encoding made it possible to accommodate characters from almost all written languages. In this case, the first 128 ASCII characters correspond to the same characters in Unicode.

In 2000, ASCII was the most popular encoding on the Internet and was used on 60% of web pages indexed by Google. By 2012, the share of such pages had dropped to 17%, and Unicode (UTF-8) took the place of the most popular encoding.

So ASCII is important part stories information technologies, however, its use in the future seems unpromising.

In order to use ASCII correctly, it is necessary to expand your knowledge in this area and about coding capabilities.

What it is?

ASCII is an encoding table of printable characters (see screenshot No. 1) typed on computer keyboard, to transmit information and some codes. In other words, the alphabet is encoded and decimal digits into appropriate symbols representing and carrying the necessary information.

ASCII was developed in America, so the standard character set usually includes the English alphabet with numbers, for a total of about 128 characters. But then a fair question arises: what to do if encoding of the national alphabet is required?

Other versions of the ASCII table have been developed to address similar issues. For example, for languages ​​with a foreign structure, the letters of the English alphabet were either removed, or additional characters were added to them in the form of a national alphabet. Thus, the ASCII encoding may contain Russian letters for national use (see screenshot No. 2).

Where is the ASCII coding system used?

This coding system is necessary not only for typing text information on the keyboard. It is also used in graphics. For example, in the ASCII Art Maker program graphic images various extensions consist of a range of ASCII characters (see screenshot No. 3).


Usually, similar programs can be divided into those that perform the function of graphic editors, inverting an image into text, and those that convert an image into ASCII graphics. The well-known emoticon (or as it is also called “smiling human face") is also an example of an encoding character.

This encoding method can also be used when writing or creating an HTML document. For example, you enter a specific and necessary set of characters, and when viewing the page itself, the symbol corresponding to this code will be displayed on the screen.

Among other things this type encoding is necessary when creating a multilingual website, because characters that are not included in a particular national table will need to be replaced with ASCII codes. If the reader is directly connected with information and communication technologies (ICT), then it will be useful for him to familiarize himself with such systems as:

  • Portable character set;
  • Control characters;
  • EBCDIC;
  • VISCII;
  • YUSCII;
  • Unicode;
  • ASCII art;
  • KOI-8.
  • ASCII Table Properties

    Like any systematic program, ASCII has its own characteristic properties. So, for example, the decimal number system (numbers from 0 to 9) is converted to binary system calculus (i.e. each decimal digit is converted to binary 288=1001000 respectively).

    The letters located in the upper and lower columns differ from each other only by a bit, which significantly reduces the level of complexity of checking and editing the case.

    With all these properties, ASCII encoding works as eight-bit, although it was originally intended to be seven-bit.

    Use of ASCII in Microsoft Office programs:

    If necessary this option information encoding can be used in Microsoft Notepad and Microsoft Office Word. Within these applications, the document can be saved in ASCII format, but in this case, you will not be able to use some functions when typing text.

    In particular, bolding and bolding will not be available because encoding only preserves the meaning of the typed information, and not the general appearance and form. You can add such codes to a document using the following software applications:

    • Microsoft Excel;
    • Microsoft FrontPage;
    • Microsoft InfoPath;
    • Microsoft OneNote;
    • Microsoft Outlook;
    • Microsoft PowerPoint;
    • Microsoft Project.

    It is worth considering that when typing the ASCII code in these applications, you must hold down keyboard key ALT.

    Of course, all the necessary codes require a longer and more detailed study, but this is beyond the scope of our article today. I hope that you found it really useful.

    See you again!

    Good bad

    Excel for Office 365 Word for Office 365 Outlook for Office 365 PowerPoint for Office 365 Publisher for Office 365 Excel 2019 Word 2019 Outlook 2019 PowerPoint 2019 OneNote 2016 Publisher 2019 Visio Professional 2019 Visio Standard 2019 Excel 2016 Word 2016 Outlook 2016 PowerPoint 2016 OneNote 2013 Publisher 2016 Visio 2013 Visio Professional 2016 Visio Standard 2016 Excel 2013 Word 2013 Outlook 2013 PowerPoint 2013 Publisher 2013 Excel 2010 Word 2010 Outlook 2010 PowerPoint 2010 OneNote 2010 Publisher 2010 Visio 2010 Excel 2007 Word 2007 Outlook 200 7 PowerPoint 2007 Publisher 2007 Access 2007 Visio 2007 OneNote 2007 Office 2010 Visio Standard 2007 Visio Standard 2010 Less

    In this article: Insert an ASCII or Unicode character into a document

    If you only need to enter a few special characters or symbols, you can use keyboard shortcuts. For a list of ASCII characters, see the following tables or the article Inserting National Alphabets Using Keyboard Shortcuts.

    Notes:

    Inserting ASCII characters

    To insert an ASCII character, press and hold the ALT key while entering the character code. For example, to insert a degree symbol (º), press and hold the ALT key, then type 0176 into numeric keypad.

    To enter numbers, use the numeric keypad rather than the numbers on the main keyboard. If you need to enter numbers on the numeric keypad, make sure the NUM LOCK indicator is on.

    Inserting Unicode Characters

    To insert a Unicode character, enter the character code, then press ALT keys and X. For example, to insert a dollar symbol ($), enter 0024 and press ALT and X in sequence. For all Unicode character codes, see .

    Important: Some Microsoft programs Office, such as PowerPoint and InfoPath, do not support converting Unicode codes to characters. If you need to insert a Unicode character in one of these programs, use .

    Notes:

      If the wrong Unicode character appears after you press ALT+X, select the correct code, and then press ALT+X again.

      In addition, you must enter "U+" before the code. For example, if you enter "1U+B5" and press ALT+X, the text "1µ" will be displayed, and if you enter "1B5" and press ALT+X, the symbol "Ƶ" will be displayed.

    Using the symbol table

    A symbol table is a program built into Microsoft Windows, which allows you to view the characters available for the selected font.

    Using a symbol table, you can copy individual symbols or a group of symbols to the clipboard and paste them into any program that supports displaying those symbols. Opening the symbol table

      In Windows 10, enter the word "symbol" in the search box on the taskbar and select the symbol table from the search results.

      In Windows 8, enter the word "character" at home screen and select symbol table from the search results.

      In Windows 7, click the Start button, select All Programs, Accessories, System Tools, and then click Character Map.

    Characters are grouped by font. Click the font list to select the appropriate character set. To select a symbol, click it, then click the Select button. To insert a symbol, click right click mouse over the desired location in the document and select Paste.

    Frequently used character codes

    Full list characters, see on your computer, ASCII character code table, or Unicode character tables organized by set.

    Glyph

    Glyph

    Currency

    Legal symbols

    Mathematical symbols

    Fractions

    Punctuation and dialect symbols

    Shape symbols

    Commonly used diacritics codes

    For a complete list of glyphs and corresponding codes, see.

    Glyph

    Glyph

    Non-printing ASCII control characters

    The characters used to control some peripheral devices, such as printers, are numbered 0–31 in the ASCII table. For example, the page feed/new page character is number 12. This character tells the printer to move to the beginning of the next page.

    Table of non-printing ASCII control characters

    Decimal number

    Sign

    Decimal number

    Sign

    Freeing the data channel

    Start of title

    First device control code

    Beginning of text

    Second device control code

    End of text

    Third device control code

    End of transmission

    Fourth device control code

    five-pointed

    Negative confirmation

    Confirmation

    Synchronous transmission mode

    Sound signal

    End of transmitted data block

    Horizontal tabulation

    End of media

    Line feed/new line

    Replacement symbol

    Vertical tab

    exceed

    Page translation/new page

    Twelve

    File separator

    Carriage return

    Group separator

    Shift without storing bits

    Record separator

    Bit-preserving shift

    fifteen

    Data separator





    

    2024 gtavrl.ru.