Table of symbols in binary code. ASCII encoding (American standard code for information interchange) - basic text encoding for the Latin alphabet

Youtube

A computer understands the process of converting it into a form that allows for more convenient transmission, storage or automatic processing of this data. Various tables are used for this purpose. ASCII was the first system developed in the United States for working with English text, which subsequently became widespread throughout the world. The article below is devoted to its description, features, properties and further use.

Display and storage of information in a computer

Symbols on a computer monitor or one or another mobile digital gadget are formed based on sets of vector forms of various characters and a code that allows you to find among them the symbol that needs to be inserted in the right place. It represents a sequence of bits. Thus, each character must uniquely correspond to a set of zeros and ones, which appear in a certain, unique order.

How it all began

Historically, the first computers were English-language. To encode symbolic information in them, it was enough to use only 7 bits of memory, while 1 byte consisting of 8 bits was allocated for this purpose. The number of characters understood by the computer in this case was 128. These characters included the English alphabet with its punctuation marks, numbers and some special characters. The English-language seven-bit encoding with the corresponding table (code page), developed in 1963, was called the American Standard Code for Information Interchange. Usually, the abbreviation “ASCII encoding” was and is still used to denote it.

Transition to multilingualism

Over time, computers became widely used in non-English speaking countries. In this regard, there was a need for encodings that allow the use of national languages. It was decided not to reinvent the wheel and take ASCII as a basis. The encoding table in the new edition has expanded significantly. The use of the 8th bit made it possible to translate into computer language already 256 characters.

Description

The ASCII encoding has a table that is divided into 2 parts. Only its first half is considered to be a generally accepted international standard. It includes:

Characters with serial numbers from 0 to 31, encoded in sequences from 00000000 to 00011111. They are reserved for control characters that control the process of displaying text on the screen or printer, feeding sound signal and so on.
Characters with NN in the table from 32 to 127, encoded by sequences from 00100000 to 01111111 form the standard part of the table. These include a space (N 32), letters of the Latin alphabet (lowercase and uppercase), ten-digit numbers from 0 to 9, punctuation marks, brackets of different styles and other symbols.
Characters with serial numbers from 128 to 255, encoded by sequences from 10000000 to 11111111. These include letters of national alphabets other than Latin. It is this alternative part of the ASCII table that is used to convert Russian characters into computer form.

Some properties

Features of the ASCII encoding include the difference between the letters “A” - “Z” of lower and upper case by only one bit. This circumstance greatly simplifies register conversion, as well as checking whether it belongs to a given range of values. In addition, all letters in the ASCII encoding system are represented by their own sequence numbers in the alphabet, which are written with 5 digits in the binary number system, preceded by 011 2 for lowercase letters and 010 2 for uppercase letters.

One of the features of the ASCII encoding is the representation of 10 digits - “0” - “9”. In the second number system they start with 00112 and end with 2 number values. So, 0101 2 is equivalent decimal number five, so the symbol "5" is written as 0011 01012. Based on what has been said, you can easily convert binary decimal numbers to an ASCII string by adding the bit sequence 00112 to each nibble on the left.

"Unicode"

As you know, thousands of characters are required to display texts in the languages of the Southeast Asian group. Such a number of them cannot be described in any way in one byte of information, so even extended versions of ASCII could no longer satisfy the increased needs of users from different countries.

Thus, the need arose to create a universal text encoding, the development of which, in collaboration with many leaders of the global IT industry, was undertaken by the Unicode consortium. Its specialists created the UTF 32 system. In it, 32 bits were allocated to encode 1 character, constituting 4 bytes of information. The main disadvantage was a sharp increase in the amount of required memory by as much as 4 times, which entailed many problems.

At the same time, for most countries with official languages belonging to the Indo-European group, the number of characters equal to 2 32 is more than excessive.

As a result of further work by specialists from the Unicode consortium, the UTF-16 encoding appeared. It became the option for converting symbolic information that suited everyone both in terms of the amount of memory required and the number of encoded characters. That is why UTF-16 was adopted by default and requires 2 bytes to be reserved for one character.

Even this fairly advanced and successful version of Unicode had some drawbacks, and after the transition from the extended version of ASCII to UTF-16, the weight of the document doubled.

In this regard, it was decided to use UTF-8 variable length encoding. In this case, each character of the source text is encoded as a sequence of length from 1 to 6 bytes.

Contact American standard code for information interchange

All Latin characters in UTF-8 variable length are encoded into 1 byte, as in the ASCII encoding system.

A special feature of YTF-8 is that in the case of text in Latin without using other characters, even programs that do not understand Unicode will still be able to read it. In other words, the basic part of the encoding ASCII text simply becomes part of a new variable-length UTF. Cyrillic characters in YTF-8 occupy 2 bytes, and, for example, Georgian characters - 3 bytes. By creating UTF-16 and 8, the main problem of creating a single code space in fonts was solved. Since then, font manufacturers can only fill the table with vector forms of text characters based on their needs.

Different operating systems prefer different encodings. To be able to read and edit texts typed in a different encoding, Russian text conversion programs are used. Some text editors contain built-in transcoders and allow you to read text regardless of encoding.

Now you know how many characters are in the ASCII encoding and how and why it was developed. Of course, today the Unicode standard is most widespread in the world. However, we must not forget that it is based on ASCII, so the contribution of its developers to the IT field should be appreciated.

Dec	Hex	Symbol	Dec	Hex	Symbol
000	00	specialist. NOP	128	80	Ђ
001	01	specialist. SOH	129	81	Ѓ
002	02	specialist. STX	130	82	‚
003	03	specialist. ETX	131	83	ѓ
004	04	specialist. EOT	132	84	„
005	05	specialist. ENQ	133	85	…
006	06	specialist. ACK	134	86	†
007	07	specialist. BEL	135	87	‡
008	08	specialist. B.S.	136	88	€
009	09	specialist. TAB	137	89	‰
010	0A	specialist. LF	138	8A	Љ
011	0B	specialist. VT	139	8B	‹ ‹
012	0C	specialist. FF	140	8C	Њ
013	0D	specialist. CR	141	8D	Ќ
014	0E	specialist. SO	142	8E	Ћ
015	0F	specialist. S.I.	143	8F	Џ
016	10	specialist. DLE	144	90	ђ
017	11	specialist. DC1	145	91	‘
018	12	specialist. DC2	146	92	’
019	13	specialist. DC3	147	93	“
020	14	specialist. DC4	148	94	”
021	15	specialist. N.A.K.	149	95
022	16	specialist. SYN	150	96	–
023	17	specialist. ETB	151	97	—
024	18	specialist. CAN	152	98
025	19	specialist. E.M.	153	99	™
026	1A	specialist. SUB	154	9A	љ
027	1B	specialist. ESC	155	9B	›
028	1C	specialist. FS	156	9C	њ
029	1D	specialist. G.S.	157	9D	ќ
030	1E	specialist. R.S.	158	9E	ћ
031	1F	specialist. US	159	9F	џ
032	20	clutch SP (Space)	160	A0
033	21	!	161	A1	Ў
034	22	"	162	A2	ў
035	23	#	163	A3	Ћ
036	24	$	164	A4	¤
037	25	%	165	A5	Ґ
038	26	&	166	A6	¦
039	27	"	167	A7	§
040	28	(	168	A8	Yo
041	29	)	169	A9	©
042	2A	*	170	A.A.	Є
043	2B	+	171	AB	«
044	2C	,	172	A.C.	¬
045	2D	-	173	AD
046	2E	.	174	A.E.	®
047	2F	/	175	A.F.	Ї
048	30	0	176	B0	°
049	31	1	177	B1	±
050	32	2	178	B2	І
051	33	3	179	B3	і
052	34	4	180	B4	ґ
053	35	5	181	B5	µ
054	36	6	182	B6	¶
055	37	7	183	B7	·
056	38	8	184	B8	e
057	39	9	185	B9	№
058	3A	:	186	B.A.	є
059	3B	;	187	BB	»
060	3C	<	188	B.C.	ј
061	3D	=	189	BD	Ѕ
062	3E	>	190	BE	ѕ
063	3F	?	191	B.F.	ї
064	40	@	192	C0	A
065	41	A	193	C1	B
066	42	B	194	C2	IN
067	43	C	195	C3	G
068	44	D	196	C4	D
069	45	E	197	C5	E
070	46	F	198	C6	AND
071	47	G	199	C7	Z
072	48	H	200	C8	AND
073	49	I	201	C9	Y
074	4A	J	202	C.A.	TO
075	4B	K	203	C.B.	L
076	4C	L	204	CC	M
077	4D	M	205	CD	N
078	4E	N	206	C.E.	ABOUT
079	4F	O	207	CF	P
080	50	P	208	D0	R
081	51	Q	209	D1	WITH
082	52	R	210	D2	T
083	53	S	211	D3	U
084	54	T	212	D4	F
085	55	U	213	D5	X
086	56	V	214	D6	C
087	57	W	215	D7	H
088	58	X	216	D8	Sh
089	59	Y	217	D9	SCH
090	5A	Z	218	D.A.	Kommersant
091	5B	[	219	D.B.	Y
092	5C	\	220	DC	b
093	5D	]	221	DD	E
094	5E	^	222	DE	YU
095	5F	_	223	DF	I
096	60	`	224	E0	A
097	61	a	225	E1	b
098	62	b	226	E2	V
099	63	c	227	E3	G
100	64	d	228	E4	d
101	65	e	229	E5	e
102	66	f	230	E6	and
103	67	g	231	E7	h
104	68	h	232	E8	And
105	69	i	233	E9	th
106	6A	j	234	E.A.	To
107	6B	k	235	E.B.	l
108	6C	l	236	E.C.	m
109	6D	m	237	ED	n
110	6E	n	238	E.E.	O
111	6F	o	239	E.F.	P
112	70	p	240	F0	R
113	71	q	241	F1	With
114	72	r	242	F2	T
115	73	s	243	F3	at
116	74	t	244	F4	f
117	75	u	245	F5	X
118	76	v	246	F6	ts
119	77	w	247	F7	h
120	78	x	248	F8	w
121	79	y	249	F9	sch
122	7A	z	250	F.A.	ъ
123	7B	{	251	FB	s
124	7C	\|	252	F.C.	b
125	7D	}	253	FD	uh
126	7E	~	254	F.E.	Yu
127	7F	Specialist. DEL	255	FF	I

ASCII Windows character code table.
Description of special (control) characters

It should be noted that ASCII table control characters were originally used to ensure data exchange via teletypewriter, data entry from punched tape, and for simple control of external devices.
Currently, most of the ASCII table control characters no longer carry this load and can be used for other purposes.

Code	Description
NUL, 00	Null, empty
SOH, 01	Start Of Heading
STX, 02	Start of TeXt, the beginning of the text.
ETX, 03	End of TeXt, end of text
EOT, 04	End of Transmission, end of transmission
ENQ, 05	Enquire. Please confirm
ACK, 06	Acknowledgment. I confirm
BEL, 07	Bell, call
BS, 08	Backspace, go back one character
TAB, 09	Tab, horizontal tab
LF, 0A	Line Feed, line feed. Nowadays in most programming languages it is denoted as \n
VT, 0B	Vertical Tab, vertical tabulation.
FF, 0C	Form Feed, page feed, new page
CR, 0D	Carriage Return, carriage return. Nowadays in most programming languages it is denoted as \r
SO,0E	Shift Out, change the color of the ink ribbon in the printing device
SI, 0F	Shift In, return the color of the ink ribbon in the printing device back
DLE, 10	Data Link Escape, switching the channel to data transmission
DC1, 11 DC2, 12 DC3, 13 DC4, 14	Device Control, device control symbols
NAK, 15	Negative Acknowledgment, I do not confirm.
SYN, 16	Synchronization. Synchronization symbol
ETB, 17	End of Text Block, end of the text block
CAN, 18	Cancel, cancellation of a previously transmitted
EM, 19	End of Medium
SUB, 1A	Substitute, substitute. Placed in place of a symbol whose meaning was lost or corrupted during transmission
ESC, 1B	Escape Control Sequence
FS, 1C	File Separator, file separator
GS, 1D	Group Separator
RS, 1E	Record Separator, record separator
US, 1F	Unit Separator
DEL, 7F	Delete, erase the last character.

According to the International Telecommunication Union, in 2016, three and a half billion people used the Internet with some regularity. Most of them don't even think about the fact that any messages they send via PC or mobile gadgets, as well as texts that are displayed on all kinds of monitors, are actually combinations of 0 and 1. This representation of information is called encoding. It ensures and greatly facilitates its storage, processing and transmission. In 1963, the American ASCII encoding was developed, which is the subject of this article.

Presenting information on a computer

From the point of view of any electronic computer, text is a set of individual characters. These include not only letters, including capital ones, but also punctuation marks and numbers. In addition, special characters “=”, “&”, “(” and spaces are used.

The set of characters that make up the text is called the alphabet, and their number is called cardinality (denoted as N). To determine it, the expression N = 2^b is used, where b is the number of bits or the information weight of a particular symbol.

It has been proven that an alphabet with a capacity of 256 characters can represent all the necessary characters.

Since 256 represents the 8th power of two, the weight of each character is 8 bits.

A unit of measurement of 8 bits is called 1 byte, so it is customary to say that any character in text stored on a computer takes up one byte of memory.

How is coding done?

Any texts are entered into the memory of a personal computer using keyboard keys on which numbers, letters, punctuation marks and other symbols are written. IN RAM they are transmitted in binary code, i.e. each character is associated with a decimal code familiar to humans, from 0 to 255, which corresponds to a binary code - from 00000000 to 11111111.

Byte-byte character encoding allows the processor performing text processing to access each character individually. At the same time, 256 characters are quite enough to represent any symbolic information.

ASCII character encoding

This abbreviation in English stands for code for information interchange.

Even at the dawn of computerization, it became obvious that it was possible to come up with a wide variety of ways to encode information. However, to transfer information from one computer to another, it was necessary to develop a unified standard. So, in 1963, the ASCII encoding table appeared in the USA. In it, any symbol of the computer alphabet is associated with its serial number in binary representation. ASCII was originally used only in the United States and later became an international standard for PCs.

ASCII codes are divided into 2 parts. Only the first half of this table is considered the international standard. It includes characters with serial numbers from 0 (coded as 00000000) to 127 (coded 01111111).

Serial number	ASCII text encoding	Symbol
	0000 0000 - 0001 1111	Characters with N from 0 to 31 are called control characters. Their function is to “manage” the process of displaying text on a monitor or printing device, giving a sound signal, etc.
	0010 0000 - 0111 1111	Characters from N from 32 to 127 (standard part of the table) - upper and lowercase letters of the Latin alphabet, 10th digits, punctuation marks, as well as various brackets, commercial and other symbols. The character 32 represents a space.
	1000 0000 - 1111 1111	Characters with N from 128 to 255 (alternative part of the table or code page) can have different variants, each of which has its own number. The code page is used to specify national alphabets that are different from Latin. In particular, it is with its help that ASCII encoding for Russian characters is carried out.

In the table, the encodings are capitalized and follow each other in alphabetical order, and the numbers are in ascending order. This principle remains the same for the Russian alphabet.

Control characters

The ASCII encoding table was originally created for receiving and transmitting information via a device that has not been used for a long time, such as a teletype. In this regard, non-printable characters were included in the character set, used as commands to control this device. Similar commands were used in such pre-computer messaging methods as Morse code, etc.

The most common teletype character is NUL (00). It is still used today in most programming languages to indicate the end of a line.

Where is ASCII encoding used?

American standard code necessary not only for input text information from the keyboard. It is also used in graphics. In particular, in ASCII Art Maker, the images of the various extensions represent a spectrum of ASCII characters.

There are two types of such products: those that perform the function of graphic editors by converting images into text and those that convert “drawings” into ASCII graphics. For example, the famous emoticon is a prime example of an encoding symbol.

ASCII can also be used when creating an HTML document. In this case, you can enter a certain set of characters, and when viewing the page, a symbol that corresponds to this code will appear on the screen.

ASCII is also necessary for creating multilingual websites, since characters that are not included in a specific national table are replaced with ASCII codes.

Some features

ASCII was originally used to encode text information using 7 bits (one was left blank), but today it works as 8 bits.

The letters located in the columns located above and below differ from each other in only one single bit. This significantly reduces the complexity of the audit.

Using ASCII in Microsoft Office

If necessary, this type of text information encoding can be used in Microsoft text editors such as Notepad and Office Word. However, you may not be able to use some functions when typing in this case. For example, you will not be able to select in bold, since ASCII encoding preserves only the meaning of the information, ignoring its general appearance and form.

Standardization

The ISO organization has adopted ISO 8859 standards. This group defines eight-bit encodings for different language groups. Specifically, ISO 8859-1 is Extended ASCII, which is a table for the United States and countries Western Europe. And ISO 8859-5 is a table used for the Cyrillic alphabet, including the Russian language.

For a number of historical reasons, the ISO 8859-5 standard was used for a very short time.

For Russian language this moment The actual encodings used are:

CP866 (Code Page 866) or DOS, which is often called alternative GOST encoding. It was actively used until the mid-90s of the last century. At the moment it is practically not used.
KOI-8. The encoding was developed in the 1970s and 80s, and is now the generally accepted standard for mail messages in Runet. It is widely used in Unix operating systems, including Linux. The “Russian” version of KOI-8 is called KOI-8R. In addition, there are versions for other Cyrillic languages, such as Ukrainian.
Code Page 1251 (CP 1251, Windows - 1251). Developed by Microsoft to provide support for the Russian language in the Windows environment.

The main advantage of the first CP866 standard was the preservation of pseudographic characters in the same positions as in Extended ASCII. This allowed it to run without changes text programs, foreign production, such as the famous Norton Commander. Currently, CP866 is used for programs developed for Windows that run in full-screen text mode or in text windows, including FAR Manager.

Computer texts written in CP866 encoding, in Lately They are quite rare, but it is the one that is used for Russian file names in Windows.

"Unicode"

At the moment, this encoding is the most widely used. Unicode codes are divided into areas. The first (U+0000 to U+007F) includes ASCII characters with codes. This is followed by the character areas of various national scripts, as well as punctuation marks and technical symbols. In addition, some Unicode codes are reserved in case there is a need to include new characters in the future.

Now you know that in ASCII, each character is represented as a combination of 8 zeros and ones. To non-specialists, this information may seem unnecessary and uninteresting, but don’t you want to know what’s going on “in the brains” of your PC?!

Let's remember some facts we know:

The set of symbols with which text is written is called alphabet.

The number of characters in an alphabet is its cardinality.

Formula for determining the amount of information: N = 2 b,

where N is the power of the alphabet (number of characters),

b - number of bits (information weight of the symbol).

The alphabet with a capacity of 256 characters can accommodate almost all the necessary characters. Such an alphabet is called sufficient.

Because 256 = 2 8 , then the weight of 1 character is 8 bits.

The unit of measurement 8 bits was given the name 1 byte:

1 byte = 8 bits.

The binary code of each character in computer text takes up 1 byte of memory.

How is text information represented in computer memory?

Coding consists of assigning each character a unique decimal code from 0 to 255 or a corresponding binary code from 00000000 to 11111111. Thus, a person distinguishes characters by their outline, and a computer by their code.

The convenience of byte-by-byte character encoding is obvious because a byte is the smallest addressable part of memory and, therefore, the processor can access each character separately when processing text. On the other hand, 256 characters is quite a sufficient number to represent a wide variety of symbolic information.

Now the question arises, which eight-bit binary code to assign to each character.

It is clear that this is a conditional matter; you can come up with many encoding methods.

ASCII table has become the international standard for PCs (read aski) (American Standard Code for Information Interchange).

Only the first half of the table is the international standard, i.e. characters with numbers from 0 (00000000), to 127 (01111111).

Serial number		Symbol
	00000000 - 00011111	Their function is to control the process of displaying text on the screen or printing, sounding a sound signal, marking up text, etc.
32 - 127	00100000 - 01111111
128 - 255	10000000 - 11111111	The second half of the ASCII code table, called the code page (128 codes, starting with 10000000 and ending with 11111111), can have different variants, each variant having its own number.

Please note that in the encoding table, letters (uppercase and lowercase) are arranged in alphabetical order, and numbers are ordered in ascending order. This observance of lexicographic order in the arrangement of symbols is called the principle of sequential coding of the alphabet.

The most common encoding currently used is Microsoft Windows, abbreviated CP1251.

Since the late 90s, the problem of standardizing character encoding has been solved by the introduction of a new international standard called Unicode . This is a 16-bit encoding, i.e. it allocates 2 bytes of memory for each character. Of course, this increases the amount of memory occupied by 2 times. But such a code table allows the inclusion of up to 65536 characters. The complete specification of the Unicode standard includes all the existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Let's try using an ASCII table to imagine what words will look like in the computer's memory.

Words

Memory

01100110

01101001

01101100

01100101

01100100

01101001

01110011

01101011

When entering text information into a computer, characters (letters, numbers, signs) are encoded using various code systems, which consist of a set of code tables located on the corresponding pages of standards for encoding text information. In such tables, each character is assigned a specific numeric code in hexadecimal or decimal system notations, i.e., code tables reflect the correspondence between symbol images and numeric codes and are intended for encoding and decoding text information. When entering text information using a computer keyboard, each entered character is encoded, i.e., converted into a numeric code; when text information is output to a computer output device (display, printer or plotter), its image is constructed using the numeric code of the character. The assignment of a specific numeric code to a symbol is the result of an agreement between relevant organizations in different countries. Currently, there is no single universal code table that matches the letters of the national alphabets of different countries.

Modern code tables include international and national parts, i.e. they contain letters of the Latin and national alphabets, numbers, arithmetic operations and punctuation marks, mathematical and control symbols, and pseudographic symbols. International part of the code table based on the standard ASCII (American Standard Code for Information Interchange), encodes the first half of the characters in the code table with numeric codes from 0 to 7 F 16, or in the decimal number system from 0 to 127. In this case, codes from 0 to 20 16 (0 ? 32 10) are allocated function keys(F1, F2, F3, etc.) keyboard of a personal computer. In Fig. 3.1 shows the international part of the code tables based on the standard ASCII. Table cells are numbered in decimal and hexadecimal number systems, respectively.

Figure 3.1. International part of the code table (standard ASCII) with cell numbers presented in decimal (a) and hexadecimal (b) number systems

The national part of code tables contains codes of national alphabets, which is also called a table of character sets (charset).

Currently, to support letters of the Russian alphabet (Cyrillic), there are several code tables (encodings) that are used by various operating systems, which is a significant drawback and in some cases leads to problems associated with decoding operations of numeric character values. In table 3.1 shows the names of the code pages (standards) on which the Cyrillic code tables (encodings) are located.

Table 3.1

One of the first standards for encoding the Cyrillic alphabet on computers was the KOI8-R standard. The national part of the code table of this standard is shown in Fig. 3.2.

Rice. 3.2. National part of the code table of the KOI8-R standard

Currently, the code table located on page CP866 of the text information encoding standard, which is used in the operating system, is also used MS DOS or session MS DOS for encoding the Cyrillic alphabet (Fig. 3.3, A).

Rice. 3.3. The national part of the code table, located on page CP866 (a) and on page CP1251 (b) of the text information coding standard

Currently, the most widely used code table for encoding the Cyrillic alphabet is located on page CP1251 of the corresponding standard, which is used in operating systems of the family Windows companies Microsoft(Fig. 3.2, b). In all presented code tables, except the standard table Unicode To encode one character, 8 binary digits (8 bits) are allocated.

At the end of the last century, a new international standard appeared Unicode in which one character is represented as a two-byte binary code. The application of this standard is a continuation of the development of a universal international standard to solve the problem of compatibility of national character encodings. By using this standard 2 16 = 65536 different characters can be encoded. In Fig. 3.4 shows the code table 0400 (Russian alphabet) of the standard Unicode.

Rice. 3.4. Unicode code table 0400

Let us explain what has been said regarding the encoding of text information using an example.

Example 3.1

Encode the word “Computer” as a sequence of decimal and hexadecimal numbers using CP1251 encoding. What characters will be displayed in the CP866 and KOI8-R code tables when using the received code.

Sequences of hexadecimal and binary code of the word “Computer” based on the CP1251 encoding table (see Fig. 3.3, b) will look like this:

This code sequence in SR866 and KOI8-R encodings will result in the display of the following characters:

To convert Russian-language text documents from one text information encoding standard to another, special programs are used - converters. Converters are usually built into other programs. An example would be a browser program - Internet Explorer(IE), which has a built-in converter. The browser program is special program to view content Web pages in the global computer network Internet. Let's use this program to confirm the symbol mapping result obtained in example 3.1. To do this, we will perform the following steps.

1. Launch the Notepad program (NotePad). Notepad program in the operating system Windows XP launched using the command: [Button Start– Programs – Standard – Notepad]. In the Notepad program window that opens, type the word “Computer” using markup language syntax hypertext documents – HTML (Hyper Text Markup Language). This language is used to create documents on the Internet. The text should look like this:

Computerwater

, Where

And

tags (special constructs) of the language HTML for header markup. In Fig. Figure 3.5 shows the result of these actions.

Rice. 3.5. Displaying text in the Notepad window

Let's save this text by executing the command: [File - Save as...] in the appropriate folder on the computer; when saving the text, we will give the file a name - Note, with a file extension. html.

2. Let's launch the program Internet Explorer, by executing the command: [Button Start- Programs - Internet Explorer]. When you start the program, the window shown in Fig. 3.6

Rice. 3.6. Offline access window

Select and activate the button Offline this will prevent the computer from connecting to global network Internet. The main program window will appear Microsoft Internet Explorer, shown in Fig. 3.7.

Rice. 3.7. Microsoft Internet Explorer main window

Let's do it next command: [File – Open], a window will appear (Fig. 3.8), in which you need to specify the file name and click the button OK or press the button Review… and find the file Prim.html.

Rice. 3.8. Open window

The main window of the Internet Explorer program will take the form shown in Fig. 3.9. The word “Computer” will appear in the window. Next, using the top menu of the program Internet Explorer, run the following command: [View – Encoding – Cyrillic (DOS)]. After executing this command in the program window Internet Explorer The symbols shown in Fig. will be displayed. 3.10. When executing the command: [View – Encoding – Cyrillic (KOI8-R) ] in the program window Internet Explorer The symbols shown in Fig. will be displayed. 3.11.

Rice. 3.9. Characters displayed with CP1251 encoding

Rice. 3.10. Characters displayed when CP866 encoding is enabled for a code sequence represented in CP1251 encoding

Rice. 3.11. Characters displayed when KOI8-R encoding is enabled for a code sequence represented in CP1251 encoding

Thus, obtained using the program Internet Explorer the character sequences coincide with the character sequences obtained using the CP866 and KOI8-R code tables in example 3.1.

3.2. Encoding graphic information

Graphic information presented in the form of pictures, photographs, slides, moving images (animation, video), diagrams, drawings can be created and edited using a computer, and it is encoded accordingly. Currently there are quite a large number application programs for processing graphic information, but they all implement three types of computer graphics: raster, vector and fractal.

If you take a closer look at the graphic image on the computer monitor screen, you can see a large number of multi-colored dots (pixels - from the English. pixel educated from picture element – element of the image), which, when collected together, form a given graphic image. From this we can conclude: a graphic image on a computer is encoded in a certain way and must be presented in the form graphic file. A file is the basic structural unit of organizing and storing data in a computer and in in this case must contain information on how to represent this set of points on the monitor screen.

Files created on the basis of vector graphics contain information in the form of mathematical relationships (mathematical functions that describe linear relationships) and corresponding data on how to construct an image of an object using line segments (vectors) when displayed on a computer monitor.

Files created based on raster graphics require storing data about each individual point in the image. To display raster graphics, complex mathematical calculations are not required; it is enough to simply obtain data about each point of the image (its coordinates and color) and display them on the computer monitor screen.

During the encoding process, an image is spatially discretized, i.e., the image is divided into individual points and each point is given a color code (yellow, red, blue, etc.). To encode each point of a color graphic image, the principle of decomposition of an arbitrary color into its main components is used, for which three primary colors are used: red (English word Red, denoted by a letter TO), green (Green, denoted by a letter G), blue (Blue, denoted by beech IN). Any color of a dot perceived by the human eye can be obtained by additive (proportional) addition (mixing) of three primary colors - red, green and blue. This coding system is called a color system RGB. Files graphic images, in which the color system is used RGB represent each point of the image as a color triplet - three numerical values R, G And IN, corresponding intensities of red, green and blue colors. The process of encoding a graphic image is carried out using various technical means(scanner, digital camera, digital video camera, etc.); the result is a raster image. When reproducing color graphic images on a color computer monitor, the color of each point (pixel) of such an image is obtained by mixing three primary colors R,G And B.

Quality bitmap is determined by two main parameters - resolution (the number of pixels horizontally and vertically) and the color palette used (the number of specified colors for each pixel in the image). Resolution is specified by indicating the number of pixels horizontally and vertically, for example 800 by 600 pixels.

There is a relationship between the number of colors assigned to a point in a raster image and the amount of information that must be allocated to store the color of the point, determined by the relationship (R. Hartley’s formula):

Where I– amount of information; N – the number of colors assigned to the point.

The amount of information required to store the color of a point is also called color depth, or color rendering quality.

So, if the number of colors specified for an image point is N= 256, then the amount of information required for its storage (color depth) in accordance with formula (3.1) will be equal to I= 8 bits.

Computers use various types of graphics to display information. graphics modes monitor operation. It should be noted here that in addition to the graphic mode of the monitor, there is also a text mode, in which the monitor screen is conventionally divided into 25 lines of 80 characters per line. These graphics modes are characterized by the monitor's screen resolution and color quality (color depth). To set the graphic mode of the monitor screen in the operating system MS Windows XP you need to execute the command: [Button Start– Settings – Control Panel – Screen]. In the “Properties: Screen” dialog box that appears (Fig. 3.12), you must select the “Parameters” tab and use the “Screen Resolution” slider to select the appropriate screen resolution (800 by 600 pixels, 1024 by 768 pixels, etc.). Using the “Color quality” drop-down list, you can select the color depth - “Highest (32 bits)”, “Medium (16 bits)”, etc., and the number of colors assigned to each point in the image will be respectively 2 32 (4294967296), 2 16 (65536), etc.

Rice. 3.12. Display Properties Dialog Box

To implement each of the graphic modes of the monitor screen, a certain amount of computer video memory is required. Required information volume of video memory (V) is determined from the relation

Where TO - number of image points on the monitor screen (K = A · B); A - number of horizontal dots on the monitor screen; IN - number of vertical dots on the monitor screen; I– amount of information (color depth).

So, if the monitor screen has a resolution of 1024 by 768 pixels and a palette consisting of 65,536 colors, then the color depth in accordance with formula (3.1) will be I = log 2 65,538 = 16 bits, the number of image pixels will be equal to: K = 1024 x 768 = 786432, and the required information volume of video memory in accordance with (3.2) will be equal to

V= 786432 · 16 bits = 12582912 bits = 1572864 bytes = 1536 KB = 1.5 MB.

In conclusion, it should be noted that in addition to the listed characteristics, the most important characteristics of a monitor are the geometric dimensions of its screen and image points. The geometric dimensions of the screen are determined by the diagonal size of the monitor. The diagonal size of monitors is specified in inches (1 inch = 1" = 25.4 mm) and can take values equal to: 14", 15", 17", 21", etc. Modern monitor production technologies can provide an image point size equal to 0.22 mm.

Thus, for each monitor there is a physically maximum possible screen resolution, determined by the size of its diagonal and the size of the image point.

Exercises to do on your own

1. Using the program MS Excel convert ASCII, SR866, SR1251, KOI8-R code tables to tables of the form: in the cells of the first column of the tables write in alphabetical order the uppercase and then lowercase letters of the Latin and Cyrillic alphabet, in the cells of the second column - the codes corresponding to the letters in the decimal number system, in the cells the third column is the codes corresponding to the letters in the hexadecimal number system. Code values must be selected from the corresponding code tables.

2. Encode and write down the following words as a sequence of numbers in the decimal and hexadecimal number systems:

a) Internet Explorer, b) Microsoft Office; V) CorelDRAW.

Encoding is carried out using the modernized ASCII encoding table obtained in the previous exercise.

3. Using the modernized KOI8-R encoding table, decode sequences of numbers written in the hexadecimal number system:

a) FC CB DA C9 D3 D4 C5 CE C3 C9 D1;

b) EB CF CE C6 CF D2 CD C9 DA CD;

c) FC CB D3 D0 D2 C5 D3 C9 CF CE C9 DA CD.

4. How will the word “Cybernetics” written in SR1251 encoding look like when using SR866 and KOI8-R encodings? Check the results using the program Internet Explorer.

5. Using the code table shown in Fig. 3.1 A, decode the following code sequences written in binary number system:

a) 01010111 01101111 01110010 01100100;

b) 01000101 01111000 01100011 01100101 01101100;

c) 01000001 01100011 01100011 01100101 01110011 01110011.

6. Determine the information volume of the word “Economy”, encoded using code tables SR866, SR1251, Unicode and KOI8-R.

7. Determine the information volume of the file obtained as a result of scanning a color image measuring 12x12 cm. The resolution of the scanner used to scan this image is 600 dpi. The scanner sets the color depth of the image point to 16 bits.

Scanner resolution 600 dpi (dotper inch - dots per inch) determines the ability of a scanner with this resolution to distinguish 600 dots on a 1-inch segment.

8. Determine the information volume of the file obtained as a result of scanning a color image of A4 size. The resolution of the scanner used to scan this image is 1200 dpi. The scanner sets the color depth of the image point to 24 bits.

9. Determine the number of colors in the palette at color depths of 8, 16, 24 and 32 bits.

10. Determine the required amount of video memory for graphic modes of the monitor screen 640 by 480, 800 by 600, 1024 by 768 and 1280 by 1024 pixels with an image pixel color depth of 8, 16, 24 and 32 bits. Summarize the results in a table. Develop in MS Excel program for automating calculations.

11. Determine the maximum number of colors that can be used to store an image measuring 32 by 32 pixels, if the computer has 2 KB of memory allocated for the image.

12. Determine the maximum possible resolution of a monitor screen with a diagonal length of 15" and an image point size of 0.28 mm.

13. What graphic modes of the monitor can be provided by 64 MB of video memory?

Contents

I. History of information coding……………………………..3

II. Coding of information……………………………………………………4

III. Coding of text information…………………………….4

IV. Types of encoding tables……………………………………………………...6

V. Calculation of the amount of text information………………………14

List of references……………………………..16

I . History of information coding

Humanity has been using text encryption (encoding) since the very moment when the first one appeared. secret information. Here are several text encoding techniques that were invented at various stages of the development of human thought:

Cryptography is secret writing, a system of changing writing in order to make the text incomprehensible to the uninitiated;

Morse code or uneven telegraph code, in which each letter or sign is represented by its own combination of short chips electric current(dots) and elementary parcels of triple duration (dash);

sign language is a sign language used by people with hearing impairments.

One of the earliest known encryption methods is named after the Roman emperor Julius Caesar (1st century BC). This method is based on replacing each letter of the encrypted text with another, by shifting the alphabet from the original letter by a fixed number of characters, and the alphabet is read in a circle, that is, after the letter i, a is considered. So the word “byte”, when shifted two characters to the right, is encoded as the word “gwlf”. Reverse decryption process of this word– it is necessary to replace each encrypted letter with the second one to the left of it.

II. Encoding information

Code is a set symbols(or signals) to record (or convey) some predefined concepts.

Information coding is the process of forming a specific representation of information. In a narrower sense, the term “coding” is often understood as a transition from one form of information representation to another, more convenient for storage, transmission or processing.

Usually, each image when encoding (sometimes called encryption) is represented by a separate sign.

A sign is an element of a finite set of elements distinct from each other.

In a narrower sense, the term “coding” is often understood as a transition from one form of information representation to another, more convenient for storage, transmission or processing.

You can process text information on a computer. When entered into a computer, each letter is encoded with a certain number, and when output to external devices (screen or print), images of letters are constructed from these numbers for human perception. The correspondence between a set of letters and numbers is called a character encoding.

As a rule, all numbers in a computer are represented using zeros and ones (not ten digits, as is usual for people). In other words, computers usually operate in the binary number system, since this makes the devices for processing them much simpler. Entering numbers into a computer and outputting them for human reading can be done in the usual decimal form, and all necessary conversions are performed by programs running on the computer.

III. Encoding text information

The same information can be presented (encoded) in several forms. With the advent of computers, the need arose to encode all types of information that both an individual and humanity as a whole deal with. But humanity began to solve the problem of encoding information long before the advent of computers. The grandiose achievements of mankind - writing and arithmetic - are nothing more than a system for encoding speech and numerical information. Information never appears in its pure form, it is always presented somehow, encoded somehow.

Binary coding is one of the common ways of representing information. IN computers, in robots and machine tools with numerical program controlled Typically, all information that the device deals with is encoded in the form of words of the binary alphabet.

Since the late 60s, computers have increasingly been used for processing text information, and currently the bulk of personal computers in the world (and most of the time) is occupied with processing textual information. All these types of information in a computer are presented in binary code, that is, an alphabet of power two is used (only two characters 0 and 1). This is due to the fact that it is convenient to represent information in the form of a sequence of electrical impulses: there is no impulse (0), there is an impulse (1).

Such encoding is usually called binary, and logical sequences zeros and ones - in machine language.

From a computer point of view, text consists of individual characters. The symbols include not only letters (uppercase or lowercase, Latin or Russian), but also numbers, punctuation marks, special characters such as "=", "(", "&", etc., and even (pay special attention!) spaces between words.

Texts are entered into the computer's memory using the keyboard. The letters, numbers, punctuation marks and other symbols we are familiar with are written on the keys. They enter RAM in binary code. This means that each character is represented by 8-bit binary code.

Traditionally, to encode one character, an amount of information equal to 1 byte is used, i.e. I = 1 byte = 8 bits. Using a formula that connects the number of possible events K and the amount of information I, you can calculate how many different symbols can be encoded (assuming that symbols are possible events): K = 2 I = 2 8 = 256, i.e. for To represent text information, you can use an alphabet with a capacity of 256 characters.

This number of characters is quite sufficient to represent text information, including upper and lowercase letters of the Russian and Latin alphabet, numbers, signs, graphic symbols etc.

In the process of displaying a symbol on a computer screen, the reverse process is performed - decoding, that is, converting the symbol code into its image. It is important that assigning a specific code to a symbol is a matter of agreement, which is recorded in the code table.

Now the question arises, which eight-bit binary code to assign to each character. It is clear that this is a conditional matter; you can come up with many encoding methods.

All characters of the computer alphabet are numbered from 0 to 255. Each number corresponds to an eight-bit binary code from 00000000 to 11111111. This code is simply the serial number of the character in the binary number system.

IV . Types of encoding tables

A table in which all characters of the computer alphabet are assigned serial numbers is called an encoding table.

Different types of computers use different encoding tables.

The ASCII code table (American Standard Code for Information Interchange) has been adopted as an international standard, encoding the first half of characters with numeric codes from 0 to 127 (codes from 0 to 32 are assigned not to characters, but to function keys).

The ASCII code table is divided into two parts.

Only the first half of the table is the international standard, i.e. characters with numbers from 0 (00000000), to 127 (01111111).

ASCII encoding table structure

Serial number	Code	Symbol
0 - 31	00000000 - 00011111	Symbols with numbers from 0 to 31 are usually called control symbols. Their function is to control the process of displaying text on the screen or printing, sounding a sound signal, marking up text, etc.
32 - 127	0100000 - 01111111	Standard part of the table (English). This includes lowercase and uppercase letters of the Latin alphabet, decimal numbers, punctuation marks, all kinds of brackets, commercial and other symbols. Character 32 is a space, i.e. empty position in the text. All others are reflected by certain signs.
128 - 255	10000000 - 11111111	Alternative part of the table (Russian). The second half of the ASCII code table, called the code page (128 codes, starting from 10000000 and ending with 11111111), can have different options, each option has its own number. The code page is primarily used to accommodate national alphabets other than Latin. In Russian national encodings, characters from the Russian alphabet are placed in this part of the table.

First half of the ASCII code table

For letters of the Russian alphabet, the principle of sequential coding is also observed.

Second half of the ASCII code table

Unfortunately, there are currently five different Cyrillic encodings (KOI8-R, Windows. MS-DOS, Macintosh and ISO). Because of this, problems often arise with transferring Russian text from one computer to another, from one software system to another.

Chronologically, one of the first standards for encoding Russian letters on computers was KOI8 ("Information Exchange Code, 8-bit"). This encoding was used back in the 70s on computers of the ES computer series, and from the mid-80s it began to be used in the first Russified versions of the UNIX operating system.

From the early 90s, the time of dominance of the MS DOS operating system, the CP866 encoding remains ("CP" means "Code Page", "code page").

Apple computers running operating systems Mac systems OS, use their own Mac encoding.

In addition, the International Standards Organization (ISO) has approved another encoding called ISO 8859-5 as a standard for the Russian language.

The most common encoding currently used is Microsoft Windows, abbreviated CP1251. Introduced by Microsoft; Taking into account the wide distribution of operating systems (OS) and other software products of this company in the Russian Federation, it has found wide distribution.

Since the late 90s, the problem of standardizing character encoding has been solved by the introduction of a new international standard called Unicode.

This is a 16-bit encoding, i.e. it allocates 2 bytes of memory for each character. Of course, this increases the amount of memory occupied by 2 times. But such a code table allows the inclusion of up to 65536 characters. The complete specification of the Unicode standard includes all the existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Internal representation of words in computer memory

using an ASCII table

Sometimes it happens that a text consisting of letters of the Russian alphabet received from another computer cannot be read - some kind of “abracadabra” is visible on the monitor screen. This happens because computers use different character encodings for the Russian language.

Thus, each encoding is specified by its own code table. As can be seen from the table, different characters are assigned to the same binary code in different encodings.

For example, the sequence of numeric codes 221, 194, 204 in the CP1251 encoding forms the word “computer”, whereas in other encodings it will be a meaningless set of characters.

Fortunately, in most cases the user does not have to worry about transcoding text documents, since this is done by special converter programs built into applications.

V . Calculation of the amount of text information

Task 1: Encode the word “Rome” using the KOI8-R and CP1251 encoding tables.

Solution:

Task 2: Assuming that each character is encoded in one byte, estimate the information volume of the following sentence:

“My uncle has the most honest rules,

When I seriously fell ill,

He forced himself to respect

And I couldn’t think of anything better.”

Solution: This phrase has 108 characters, including punctuation, quotation marks and spaces. We multiply this number by 8 bits. We get 108*8=864 bits.

Task 3: The two texts contain the same number of characters. The first text is written in Russian, and the second in the language of the Naguri tribe, whose alphabet consists of 16 characters. Whose text contains more information?

Solution:

1) I = K * a (the information volume of the text is equal to the product of the number of characters and the information weight of one character).

2) Because Both texts have the same number of characters (K), then the difference depends on the information content of one character of the alphabet (a).

3) 2 a1 = 32, i.e. a 1 = 5 bits, 2 a2 = 16, i.e. and 2 = 4 bits.

4) I 1 = K * 5 bits, I 2 = K * 4 bits.

5) This means that the text written in Russian carries 5/4 times more information.

Task 4: The size of the message, containing 2048 characters, was 1/512 of a MB. Determine the power of the alphabet.

Solution:

1) I = 1/512 * 1024 * 1024 * 8 = 16384 bits - converted the information volume of the message into bits.

2) a = I / K = 16384 /1024 = 16 bits - accounts for one character of the alphabet.

3) 2*16*2048 = 65536 characters – the power of the alphabet used.

Task 5: Laser printer Canon LBP prints at an average speed of 6.3 Kbps. How long will it take to print an 8-page document, if you know that one page has an average of 45 lines and 70 characters per line (1 character - 1 byte)?

Solution:

1) Find the amount of information contained on 1 page: 45 * 70 * 8 bits = 25200 bits

2) Find the amount of information on 8 pages: 25200 * 8 = 201600 bits

3) We reduce to common units of measurement. To do this, we convert Mbits into bits: 6.3*1024=6451.2 bits/sec.

4) Find the printing time: 201600: 6451.2 =31 seconds.

Bibliography

1. Ageev V.M. Information and coding theory: sampling and coding of measurement information. - M.: MAI, 1977.

2. Kuzmin I.V., Kedrus V.A. Fundamentals of information theory and coding. - Kyiv, Vishcha school, 1986.

3. The simplest methods of text encryption / D.M. Zlatopolsky. – M.: Chistye Prudy, 2007 – 32 p.

4. Ugrinovich N.D. Computer Science and information Technology. Textbook for grades 10-11 / N.D. Ugrinovich. – M.: BINOM. Laboratory of Knowledge, 2003. – 512 p.

5. http://school497.spb.edu.ru/uchint002/les10/les.html#n

Material for self-study on the topic of Lecture 2

Encoding ASCII

ASCII encoding table (ASCII - American Standard Code for Information Interchange - American Standard Code for Information Interchange).

In total, 256 different characters can be encoded using the ASCII encoding table (Figure 1). This table is divided into two parts: the main one (with codes from OOh to 7Fh) and the additional one (from 80h to FFh, where the letter h indicates that the code belongs to the hexadecimal number system).

Picture 1

To encode one character from the table, 8 bits (1 byte) are allocated. When processing text information, one byte may contain the code of a certain character - a letter, number, punctuation mark, action sign, etc. Each character has its own code in the form of an integer. In this case, all codes are collected in special tables called coding tables. With their help, the symbol code is converted into its visible representation on the monitor screen. As a result, any text in computer memory is represented as a sequence of bytes with character codes.

For example, the word hello! will be coded as follows (Table 1).

Table 1


Binary code
Code decimal

Figure 1 shows the characters included in the standard (English) and extended (Russian) ASCII encoding.

The first half of the ASCII table is standardized. It contains control codes (from 00h to 20h and 77h). These codes have been removed from the table because they do not apply to text elements. Punctuation marks and mathematical symbols are also placed here: 2lh - !, 26h - &, 28h - (, 2Bh -+,..., large and small letters: 41h - A, 61h – a.

The second half of the table contains national fonts, pseudographic symbols from which tables can be constructed, and special mathematical symbols. The lower part of the encoding table can be replaced using appropriate drivers - control auxiliary programs. This technique allows you to use several fonts and their typefaces.

The display for each symbol code should display an image of the symbol - not just a digital code, but a corresponding picture, since each symbol has its own shape. A description of the shape of each character is stored in a special display memory - a character generator. The highlighting of a character on the screen of an IBM PC display, for example, is carried out using dots forming a character matrix. Each pixel in such a matrix is an image element and can be bright or dark. A dark dot is coded as 0, a light (bright) dot as 1. If you represent dark pixels in the matrix field of a sign as a dot, and light pixels as an asterisk, you can graphically depict the shape of the symbol.

People in different countries use symbols to write words in their native languages. These days, most applications, including email systems and web browsers, are pure 8-bit, meaning they can only display and correctly accept 8-bit characters, according to the ISO-8859-1 standard.

There are more than 256 characters in the world (if you take into account Cyrillic, Arabic, Chinese, Japanese, Korean and Thai), and more and more new characters are appearing. And this creates the following gaps for many users:

It is not possible to use characters from different encoding sets in the same document. Since each text document uses its own set of encodings, there are great difficulties with automatic recognition text.

New symbols appear (for example: Euro), as a result of which ISO is developing a new standard, ISO-8859-15, which is very similar to the ISO-8859-1 standard. The difference is that the old ISO-8859-1 encoding table has removed symbols for old currencies that are not currently in use to make room for newly introduced symbols (such as the Euro). As a result, users may have the same documents on their disks, but in different encodings. The solution to these problems is the adoption of a single international set of encodings called the Universal Coding or Unicode.

Encoding Unicode

The standard was proposed in 1991 by the non-profit organization Unicode Consortium (Unicode Inc.). The use of this standard makes it possible to encode very big number characters from different scripts: Unicode documents can contain Chinese characters, mathematical symbols, letters of the Greek alphabet, Latin and Cyrillic alphabet, and switching code pages becomes unnecessary.

The standard consists of two main sections: the universal character set (UCS) and the encoding family (UTF, Unicode transformation format). The universal character set specifies a one-to-one correspondence between characters and codes - elements of the code space representing non-negative integers. An encoding family defines the machine representation of a sequence of UCS codes.

The Unicode standard was developed to create a single character encoding for all modern and many ancient written languages. Each character in this standard is encoded with 16 bits, which allows it to cover an incomparably larger number of characters than previously accepted 8-bit encodings. Another important difference between Unicode and other encoding systems is that it not only assigns a unique code to each character, but also defines various characteristics of that character, for example:

character type (uppercase letter, lowercase letter, number, punctuation mark, etc.);

character attributes (display from left to right or right to left, space, line break, etc.);

corresponding capital or lowercase letter(for lowercase and uppercase letters, respectively);

the corresponding numeric value (for numeric characters).

The entire range of codes from 0 to FFFF is divided into several standard subsets, each of which corresponds either to the alphabet of a language or to a group of special characters that are similar in their functions. The diagram below contains a general list of Unicode 3.0 subsets (Figure 2).

Figure 2

The Unicode standard is the basis for storing text in many modern computer systems. However, it is not compatible with most Internet protocols because its codes can contain any byte values, and protocols typically use bytes 00 - 1F and FE - FF as service bytes. To achieve compatibility, several Unicode Transformation Formats (UTFs) have been developed, of which UTF-8 is by far the most common. This format defines the following rules for converting each Unicode code into a set of bytes (one to three) suitable for transport by Internet protocols.

Here x,y,z denote bits source code, which must be retrieved starting with the least significant one and entered into the result bytes from right to left until all specified positions are filled.

Further development of the Unicode standard is associated with the addition of new language planes, i.e. characters in the ranges 10000 - 1FFFF, 20000 - 2FFFF, etc., where it is supposed to include encoding for scripts of dead languages that are not included in the table above. A new format, UTF-16, was developed to encode these additional characters.

So there are 4 main ways to encode Unicode bytes:

UTF-8: 128 characters encoded in one byte (ASCII format), 1920 characters encoded in 2 bytes ((Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters), 63488 characters encoded in 3 bytes (Chinese, Japanese etc.) The remaining 2,147,418,112 characters (not yet used) can be encoded with 4, 5 or 6 bytes.

UCS-2: Each character is represented by 2 bytes. This encoding includes only the first 65,535 characters from the Unicode format.

UTF-16: An extension of UCS-2, it contains 1,114,112 Unicode format characters. The first 65,535 characters are represented by 2 bytes, the rest by 4 bytes.

USC-4: Each character is encoded in 4 bytes.

Hello, dear readers of the blog site. Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern Unicode consortium encodings UTF 16 and 8.

To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

ASCII - basic text encoding for the Latin alphabet

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters.

But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as “ask”). It describes the first 128 characters most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks.

These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order.

But the fact is that with one byte of information you can encode not 128, but as many as 256 different meanings(two to the power of eight equals 256), so after the basic version of Asuka there was a whole series extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

Here, it’s probably worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). , each of which is a two to the power, starting from zero, and up to two to the seventh:

It is not difficult to understand that all possible combinations of zeros and ones in such a construction can only be 256. Convert a number from binary system to decimal is quite simple. You just need to add up all the powers of two with ones above them.

In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). The total is 233 in decimal notation. As you can see, everything is very simple.

But if you take a closer look at the table with ASCII characters, you will see that they are presented in hexadecimal encoding. For example, "asterisk" corresponds to the hexadecimal number 2A in Aski. You probably know that in the hexadecimal number system, in addition to Arabic numerals, Latin letters from A (means ten) to F (means fifteen) are also used.

Well then, for converting binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. In each half byte, only sixteen values (two to the fourth power) can be encoded in binary, which can easily be represented as a hexadecimal number.

Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values that can be encoded in one byte of information. Those. It became possible to add symbols of letters of your language to Aski.

Here we will need to digress again to explain - why do we need encodings at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector forms (representations) of various characters (they are located in files with ) and code that allows you to pull out from this set of vector forms (font file) exactly the character that will need to be inserted into Right place.

It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required file font that is connected to display this text document. Everything is simple and banal.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Just for encoding Russian language characters, there are several varieties of extended Aska.

For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII.

Those. her top part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the form indicated in the screenshot just below and allowed you to encode another 128 characters (Russian letters and all sorts of pseudo-graphics):

You see, in the right column the numbers start with 8, because... numbers from 0 to 7 refer to the basic part of ASCII (see first screenshot). That. The Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will appear in the text.

Where did this amount come from? pseudographics in CP866? The whole point is that this encoding for Russian text was developed back in those shaggy years when graphical operating systems were not as widespread as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article.

Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, they did it in CP866.

If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why the cracks come out

The further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose that, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic symbols.

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this would be.

It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in similar Russian Slavic languages(Ukrainian, Belarusian, etc.):

Due to such an abundance of Russian language encodings, font manufacturers and manufacturers software headaches constantly arose, and you and I, dear readers, often got those same notorious krakozyabry when there was confusion with the version used in the text.

Very often they appeared when sending and receiving messages via e-mail, which entailed the creation of very complex conversion tables, which, in fact, were fundamentally unable to solve this problem, and users often used for correspondence to avoid the notorious gimmicks when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, the cracks appearing instead of the Russian text were the result of incorrect use of the encoding of this language, which did not correspond to the one in which the text message was originally encoded.

For example, if you try to display characters encoded using CP866 using the code Windows table 1251, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often arises in forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor, which adds gags to the code that are not visible to the naked eye.

In the end, many people got tired of this situation with a lot of encodings and constantly creeping out crap, and the prerequisites appeared for the creation of a new universal variation that would replace all the existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages like Chinese, where there were much more language characters than 256.

Unicode - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not possibly be described in one byte of information that was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding.

The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding.

As a result, the same file with text encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any real required value with a colossal reserve).

But for many countries with languages of the European group this great amount There was no need to use characters in encoding at all, however, when using UTF-32, they would never have received a fourfold increase in the weight of text documents, and as a result, an increase in the volume of Internet traffic and the volume of stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks.

In the Windows operating system, you can follow the path “Start” - “Programs” - “Accessories” - “System Tools” - “Character Table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select in " Additional options» set of Unicode characters, you can see for each font separately the entire range of characters included in it.

By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and this is the number that was adopted as the base space in Unicode. In addition, there are ways to encode about two million characters using it, but they were limited to an expanded space of a million characters of text.

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, for example, programs only in English language, because after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes per the same character in UTF-16).

It was precisely to satisfy everyone and everything in the Unicode consortium that it was decided to come up with variable length encoding. It was called UTF-8. Despite the eight in the name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length.

In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII.

What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters are encoded in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now they even come in sets.

In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in the fact that they were created for different encodings, but in the fact that the font manufacturer has filled or not completely filled the single code space with certain vector forms.

Crazy words instead of Russian letters - how to fix it

Let's now see how krakozyabrs appear instead of text or, in other words, how the correct encoding for Russian text is selected. Actually, it is set in the program in which you create or edit this very text, or code using text fragments.

For editing and creating text files Personally, I use a very good one, in my opinion, . However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read a detailed review of this wonderful program at the link provided.

In the top menu of Notepad++ there is an item “Encodings”, where you will have the opportunity to convert an existing option to the one used by default on your site:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should choose the option to avoid the appearance of cracks UTF 8 without BOM. What is the BOM prefix?

The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents.

In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs.

What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out.

Therefore, under no circumstances Don't use regular Windows notepad to edit documents on your site if you don’t want any cracks to appear. The best and most simple option I consider the already mentioned Notepad++ editor, which has practically no disadvantages and consists only of advantages.

In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from?

It is registered in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you set another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid rednecks In addition to the actions described above, it will be useful to write information about this encoding in the header of the source code of all pages of the site so that there is no confusion on the server or local host.

In general, all hypertext markup languages except Html use a special xml declaration, which specifies the text encoding.

Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

In the case of a document HTML language used to indicate encoding Meta element, which is written between the opening and closing Head tags:

... ...

This entry is quite different from the one adopted in, but is fully consistent with the new one being gradually introduced HTML standard 5, and it will be 100% correctly understood by any currently used browsers.

In theory, a Meta element indicating the encoding HTML document it would be better to put as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

Good luck to you! See you soon on the pages of the blog site

You can watch more videos by going to

");">

Table of symbols in binary code. ASCII encoding (American standard code for information interchange) - basic text encoding for the Latin alphabet

Display and storage of information in a computer

How it all began

Transition to multilingualism

Description

Some properties

"Unicode"

Contact American standard code for information interchange

ASCII Windows character code table.
Description of special (control) characters

Presenting information on a computer

How is coding done?

ASCII character encoding

Control characters

Where is ASCII encoding used?

Some features

Using ASCII in Microsoft Office

Standardization

"Unicode"

Computerwater

And

3.2. Encoding graphic information

Exercises to do on your own

ASCII - basic text encoding for the Latin alphabet

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

Windows 1251 - the modern version of ASCII and why the cracks come out

Unicode - universal encodings UTF 8, 16 and 32

Crazy words instead of Russian letters - how to fix it

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts

Table of symbols in binary code. ASCII encoding (American standard code for information interchange) - basic text encoding for the Latin alphabet

Display and storage of information in a computer

How it all began

Transition to multilingualism

Description

Some properties

"Unicode"

Contact American standard code for information interchange

ASCII Windows character code table. Description of special (control) characters

Presenting information on a computer

How is coding done?

ASCII character encoding

Control characters

Where is ASCII encoding used?

Some features

Using ASCII in Microsoft Office

Standardization

"Unicode"

Computerwater

And

3.2. Encoding graphic information

Exercises to do on your own

ASCII - basic text encoding for the Latin alphabet

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

Windows 1251 - the modern version of ASCII and why the cracks come out

Unicode - universal encodings UTF 8, 16 and 32

Crazy words instead of Russian letters - how to fix it

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts

ASCII Windows character code table.
Description of special (control) characters