Full ascii table. Encoding text information


A computer understands the process of converting it into a form that allows for more convenient transmission, storage or automatic processing of this data. Various tables are used for this purpose. ASCII encoding is the first system developed in the United States for working with English text, which subsequently became widespread throughout the world. Its description, features, properties and further use The article below is devoted to this.

Display and storage of information in a computer

Symbols on a computer monitor or one or another mobile digital gadget are formed based on sets of vector forms of various characters and a code that allows you to find among them the symbol that needs to be inserted in the right place. It represents a sequence of bits. Thus, each character must uniquely correspond to a set of zeros and ones, which appear in a certain, unique order.

How it all began

Historically, the first computers were English-language. For coding symbolic information in them it was enough to use only 7 bits of memory, while 1 byte consisting of 8 bits was allocated for this purpose. The number of characters understood by the computer in this case was 128. These characters included the English alphabet with its punctuation marks, numbers and some special characters. The English-language seven-bit encoding with the corresponding table (code page), developed in 1963, was called the American Standard Code for Information Interchange. Usually, the abbreviation “ASCII encoding” was and is still used to denote it.

Transition to multilingualism

Over time, computers became widely used in non-English speaking countries. In this regard, there was a need for encodings that allow the use of national languages. It was decided not to reinvent the wheel and take ASCII as a basis. The encoding table in the new edition has expanded significantly. The use of the 8th bit made it possible to translate into computer language already 256 characters.

Description

The ASCII encoding has a table that is divided into 2 parts. Only its first half is considered to be a generally accepted international standard. It includes:

  • Symbols with serial numbers from 0 to 31, encoded by sequences from 00000000 to 00011111. They are reserved for control characters that control the process of displaying text on the screen or printer, feeding sound signal and so on.
  • Characters with NN in the table from 32 to 127, encoded by sequences from 00100000 to 01111111 form the standard part of the table. These include a space (N 32), letters of the Latin alphabet (lowercase and uppercase), ten-digit numbers from 0 to 9, punctuation marks, brackets of different styles and other symbols.
  • Characters with serial numbers from 128 to 255, encoded by sequences from 10000000 to 11111111. These include letters of national alphabets other than Latin. It is this alternative part of the ASCII table that is used to convert Russian characters into computer form.

Some properties

Features of the ASCII encoding include the difference between the letters “A” - “Z” of lower and upper case by only one bit. This circumstance greatly simplifies register conversion, as well as checking whether it belongs to a given range of values. In addition, all letters in the ASCII encoding system are represented by their own sequence numbers in the alphabet, which are written in 5 digits in binary system Numbers preceded by 011 2 for lowercase letters and 010 2 for uppercase letters.

One of the features of the ASCII encoding is the representation of 10 digits - “0” - “9”. In the second number system they start with 00112 and end with 2 number values. So, 0101 2 is equivalent decimal number five, so the symbol "5" is written as 0011 01012. Based on what has been said, you can easily convert binary decimal numbers to an ASCII string by adding the bit sequence 00112 to each nibble on the left.

"Unicode"

As you know, thousands of characters are required to display texts in the languages ​​of the Southeast Asian group. Such a number of them cannot be described in any way in one byte of information, so even extended versions of ASCII could no longer satisfy the increased needs of users from different countries.

Thus, the need arose to create a universal text encoding, the development of which, in collaboration with many leaders of the global IT industry, was undertaken by the Unicode consortium. Its specialists created the UTF 32 system. In it, 32 bits were allocated to encode 1 character, constituting 4 bytes of information. The main disadvantage was the sharp increase in volume required memory as much as 4 times, which entailed many problems.

At the same time, for most countries with official languages ​​belonging to the Indo-European group, the number of characters equal to 2 32 is more than excessive.

As a result of further work by specialists from the Unicode consortium, the UTF-16 encoding appeared. It became the option for converting symbolic information that suited everyone both in terms of the amount of memory required and the number of encoded characters. That is why UTF-16 was adopted by default and requires 2 bytes to be reserved for one character.

Even this fairly advanced and successful version of Unicode had some drawbacks, and after the transition from the extended version of ASCII to UTF-16, the weight of the document doubled.

In this regard, it was decided to use UTF-8 variable length encoding. In this case, each character of the source text is encoded as a sequence of length from 1 to 6 bytes.

Contact American standard code for information interchange

All Latin characters in UTF-8 variable length are encoded into 1 byte, as in the ASCII encoding system.

A special feature of YTF-8 is that in the case of text in Latin without using other characters, even programs that do not understand Unicode will still be able to read it. In other words, the basic part of the encoding ASCII text simply becomes part of a new variable-length UTF. Cyrillic characters in YTF-8 occupy 2 bytes, and, for example, Georgian characters - 3 bytes. By creating UTF-16 and 8, the main problem of creating a single code space in fonts was solved. Since then, font manufacturers can only fill the table with vector forms of text characters based on their needs.

Depending on the operating system, preference is given to different encodings. To be able to read and edit texts typed in a different encoding, Russian text conversion programs are used. Some text editors contain built-in transcoders and allow you to read text regardless of encoding.

Now you know how many characters are in the ASCII encoding and how and why it was developed. Of course, today the Unicode standard is most widespread in the world. However, we must not forget that it is based on ASCII, so the contribution of its developers to the IT field should be appreciated.

In order to use ASCII correctly, it is necessary to expand your knowledge in this area and about coding capabilities.

What it is?

ASCII is an encoding table of printed characters (see screenshot No. 1) typed on a computer keyboard to transmit information and some codes. In other words, the alphabet and decimal digits are encoded into corresponding symbols that represent and carry the necessary information.

ASCII was developed in America, so the standard character set usually includes the English alphabet with numbers, for a total of about 128 characters. But then a fair question arises: what to do if encoding of the national alphabet is required?

Other versions of the ASCII table have been developed to address similar issues. For example, for languages ​​with a foreign structure, the letters of the English alphabet were either removed, or additional characters were added to them in the form of a national alphabet. Thus, the ASCII encoding may contain Russian letters for national use (see screenshot No. 2).

Where is the ASCII coding system used?

This coding system is necessary not only for dialing text information on keyboard. It is also used in graphics. For example, in the ASCII Art Maker program graphic images various extensions consist of a range of ASCII characters (see screenshot No. 3).


Usually, similar programs can be divided into those that perform the function graphic editors, inverting an image into text, and those that convert an image into ASCII graphics. The well-known emoticon (or as it is also called “ smiling human face ") is also an example of an encoding character.

This encoding method can also be used when writing or creating an HTML document. For example, you enter a specific and necessary set of characters, and when viewing the page itself, the symbol corresponding to this code will be displayed on the screen.

Among other things this type encoding is necessary when creating a multilingual website, because characters that are not included in a particular national table will need to be replaced with ASCII codes. If the reader is directly connected with information and communication technologies (ICT), then it will be useful for him to familiarize himself with such systems as:

  1. Portable character set;
  2. Control characters;
  3. EBCDIC;
  4. VISCII;
  5. YUSCII;
  6. Unicode;
  7. ASCII art;
  8. KOI-8.

ASCII Table Properties

Like any systematic program, ASCII has its own characteristic properties. So, for example, the decimal number system (digits from 0 to 9) is converted to the binary number system (i.e., each decimal digit is converted to binary 288 = 1001000, respectively).

The letters located in the upper and lower columns differ from each other only by a bit, which significantly reduces the level of complexity of checking and editing the case.

With all these properties, ASCII encoding works as eight-bit, although it was originally intended to be seven-bit.

Application of ASCII in Microsoft programs Office:

If necessary this option information encoding can be used in Microsoft Notepad and Microsoft Office Word. Within these applications, the document can be saved in ASCII format, but in this case, you will not be able to use some functions when typing text.

In particular, bold and gender selection will be unavailable. in bold, because coding preserves only the meaning of the typed information, and not the general appearance and form. You can add such codes to a document using the following software applications:

  • Microsoft Excel;
  • Microsoft FrontPage;
  • Microsoft InfoPath;
  • Microsoft OneNote;
  • Microsoft Outlook;
  • Microsoft PowerPoint;
  • Microsoft Project.

It is worth considering that when typing the ASCII code in these applications, you must hold down keyboard key ALT.

Of course, all the necessary codes require a longer and more detailed study, but this is beyond the scope of our article today. I hope that you found it really useful.

See you again!

Good bad

According to the International Telecommunication Union, in 2016, three and a half billion people used the Internet with some regularity. Most of them don't even think about the fact that any messages they send via PC or mobile gadgets, as well as texts that are displayed on all kinds of monitors, are actually combinations of 0 and 1. This representation of information is called encoding. It ensures and greatly facilitates its storage, processing and transmission. In 1963, the American ASCII encoding was developed, which is the subject of this article.

Presenting information on a computer

From the point of view of any electronic computer, text is a set of individual characters. These include not only letters, including capital ones, but also punctuation marks and numbers. In addition, special characters “=”, “&”, “(” and spaces are used.

The set of characters that make up the text is called the alphabet, and their number is called cardinality (denoted as N). To determine it, the expression N = 2^b is used, where b is the number of bits or the information weight of a particular symbol.

It has been proven that an alphabet with a capacity of 256 characters can represent all the necessary characters.

Since 256 represents the 8th power of two, the weight of each character is 8 bits.

A unit of measurement of 8 bits is called 1 byte, so it is customary to say that any character in text stored on a computer takes up one byte of memory.

How is coding done?

Any texts are entered into memory personal computer through keyboard keys on which numbers, letters, punctuation marks and other symbols are written. IN RAM they are transmitted in binary code, i.e. each character is associated with a decimal code familiar to humans, from 0 to 255, which corresponds to a binary code - from 00000000 to 11111111.

Byte-byte character encoding allows the processor performing text processing to access each character individually. At the same time, 256 characters are quite enough to represent any symbolic information.

ASCII character encoding

This abbreviation in English stands for code for information interchange.

Even at the dawn of computerization, it became obvious that it was possible to come up with a wide variety of ways to encode information. However, to transfer information from one computer to another, it was necessary to develop a unified standard. So, in 1963, the ASCII encoding table appeared in the USA. In it, any symbol of the computer alphabet is associated with its serial number in binary representation. ASCII was originally used only in the United States and later became an international standard for PCs.

ASCII codes are divided into 2 parts. Only the first half of this table is considered the international standard. It includes characters with serial numbers from 0 (coded as 00000000) to 127 (coded 01111111).

Serial number

ASCII text encoding

Symbol

0000 0000 - 0001 1111

Characters with N from 0 to 31 are called control characters. Their function is to “manage” the process of displaying text on a monitor or printing device, giving a sound signal, etc.

0010 0000 - 0111 1111

Characters with N from 32 to 127 (standard part of the table) - uppercase and lower case Latin alphabet, 10th digits, punctuation marks, as well as various brackets, commercial and other symbols. The character 32 represents a space.

1000 0000 - 1111 1111

Characters with N from 128 to 255 (alternative part of the table or code page) can have different variants, each of which has its own number. The code page is used to specify national alphabets that are different from Latin. In particular, it is with its help that ASCII encoding for Russian characters is carried out.

In the table, the encodings are capitalized and follow each other in alphabetical order, and the numbers are in ascending order. This principle remains the same for the Russian alphabet.

Control characters

The ASCII encoding table was originally created for receiving and transmitting information via a device that has not been used for a long time, such as a teletype. In this regard, non-printable characters were included in the character set, used as commands to control this device. Similar commands were used in such pre-computer messaging methods as Morse code, etc.

The most common teletype character is NUL (00). It is still used today in most programming languages ​​to indicate the end of a line.

Where is ASCII encoding used?

American standard code is necessary not only for entering text information from the keyboard. It is also used in graphics. In particular, in ASCII Art Maker, the images of the various extensions represent a spectrum of ASCII characters.

There are two types of such products: those that perform the function of graphic editors by converting images into text and those that convert “drawings” into ASCII graphics. For example, the famous emoticon is a prime example of an encoding symbol.

ASCII can also be used when creating an HTML document. In this case, you can enter a certain set of characters, and when viewing the page, a symbol that corresponds to this code will appear on the screen.

ASCII is also necessary for creating multilingual websites, since characters that are not included in a specific national table are replaced with ASCII codes.

Some features

ASCII was originally used to encode text information using 7 bits (one was left blank), but today it works as 8 bits.

The letters located in the columns located above and below differ from each other in only one single bit. This significantly reduces the complexity of the audit.

Using ASCII in Microsoft Office

If necessary, this type of text information encoding can be used in Microsoft text editors such as Notepad and Office Word. However, you may not be able to use some functions when typing in this case. For example, you won't be able to use bold text because ASCII encoding only preserves the meaning of the information, ignoring its general appearance and form.

Standardization

The ISO organization has adopted ISO 8859 standards. This group defines eight-bit encodings for different language groups. Specifically, ISO 8859-1 is Extended ASCII, which is a table for the United States and countries Western Europe. And ISO 8859-5 is a table used for the Cyrillic alphabet, including the Russian language.

For a number of historical reasons, the ISO 8859-5 standard was used for a very short time.

For Russian language this moment The actual encodings used are:

  • CP866 (Code Page 866) or DOS, which is often called alternative GOST encoding. It was actively used until the mid-90s of the last century. At the moment it is practically not used.
  • KOI-8. The encoding was developed in the 1970s and 80s, and is now the generally accepted standard for mail messages in Runet. It is widely used in OS Unix family, including Linux. The “Russian” version of KOI-8 is called KOI-8R. In addition, there are versions for other Cyrillic languages, for example Ukrainian.
  • Code Page 1251 (CP 1251, Windows - 1251). Developed by Microsoft to provide support for the Russian language in the Windows environment.

The main advantage of the first CP866 standard was the preservation of pseudographic characters in the same positions as in Extended ASCII. This allowed it to run without changes text programs, foreign production, such as the famous Norton Commander. Currently, CP866 is used for programs developed for Windows that run in full-screen text mode or in text windows, including FAR Manager.

Computer texts written in CP866 encoding, in Lately They are quite rare, but it is the one that is used for Russian file names in Windows.

"Unicode"

At the moment, this encoding is the most widely used. Unicode codes are divided into areas. The first (U+0000 to U+007F) includes ASCII characters with codes. This is followed by the character areas of various national scripts, as well as punctuation marks and technical symbols. In addition, some Unicode codes are reserved in case there is a need to include new characters in the future.

Now you know that in ASCII, each character is represented as a combination of 8 zeros and ones. To non-specialists, this information may seem unnecessary and uninteresting, but don’t you want to know what’s going on “in the brains” of your PC?!

Hello, dear readers of the blog site. Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern Unicode consortium encodings UTF 16 and 8.

To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

ASCII - basic text encoding for the Latin alphabet

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters.

But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as “ask”). It describes the first 128 characters most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks.

These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order.

But the fact is that with one byte of information you can encode not 128, but as many as 256 different meanings(two to the power of eight equals 256), so following basic version A whole series of Asukas appeared extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

Here, it’s probably worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). , each of which is a two to the power, starting from zero, and up to two to the seventh:

It's not difficult to understand that everyone possible combinations There can only be 256 zeros and ones in this design. Converting a number from binary to decimal is quite simple. You just need to add up all the powers of two with ones above them.

In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). Total gets 233 in decimal system Reckoning. As you can see, everything is very simple.

But if you look closely at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" matches in Aski hexadecimal number 2A. You probably know that in hexadecimal system Numbers are used in addition to Arabic numerals and Latin letters from A (means ten) to F (means fifteen).

Well then, for converting binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. in each half byte binary code only sixteen values ​​can be encoded (two to the fourth power), which can easily be represented as a hexadecimal number.

Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. It became possible to add symbols of letters of your language to Aski.

Here we will need to digress again to explain - why do we need encodings at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector forms (representations) of various characters (they are located in files with ) and code that allows you to pull out from this set of vector forms (font file) exactly the character that will need to be inserted into Right place.

It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required font file, which is connected to display this text document. Everything is simple and banal.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, such options exist a whole bunch. Just for encoding Russian language characters, there are several varieties of extended Aska.

For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII.

Those. her top part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the form indicated in the screenshot just below and allowed you to encode another 128 characters (Russian letters and all sorts of pseudo-graphics):

You see, in the right column the numbers start with 8, because... numbers from 0 to 7 refer to the basic part of ASCII (see first screenshot). That. The Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will appear in the text.

Where did this amount come from? pseudographics in CP866? The whole point is that this encoding for Russian text was developed back in those shaggy years when graphical operating systems were not as widespread as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article.

Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, they did it in CP866.

If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why the cracks come out

The further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose that, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic symbols.

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this would be.

It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in Slavic languages ​​close to Russian (Ukrainian, Belarusian, etc.). ):

Due to such an abundance of Russian language encodings, font manufacturers and manufacturers software headaches constantly arose, and you and I, dear readers, often got those same notorious krakozyabry when there was confusion with the version used in the text.

Very often they came out when sending and receiving messages by e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem fundamentally, and users often used for correspondence to avoid the notorious gimmicks when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, the krakozyabrs appearing instead of the Russian text were the result of incorrect use of the encoding of this language, which did not match the one in which it was encoded text message initially.

For example, if you try to display characters encoded using CP866 using the code Windows table 1251, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often arises on forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong encoding text editor, which adds gags to the code that are not visible to the naked eye.

In the end, many people got tired of this situation with a lot of encodings and constantly creeping out crap, and the prerequisites appeared for the creation of a new universal variation that would replace all the existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where there were much more language characters than 256.

Unicode - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not possibly be described in one byte of information that was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding.

The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding.

As a result, the same file with text encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any real required value with a colossal reserve).

But for many countries with languages ​​of the European group this great amount There was no need to use characters in the encoding at all, but when UTF-32 was used, they would never have received a fourfold increase in weight text documents, and as a result, an increase in the volume of Internet traffic and the volume of stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks.

In the Windows operating system, you can follow the path “Start” - “Programs” - “Accessories” - “System Tools” - “Character Table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select the Unicode character set in the “Advanced options”, you will be able to see for each font separately the entire range of characters included in it.

By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and this is the number that was adopted as the base space in Unicode. In addition, there are ways to encode about two million characters using it, but they were limited to an expanded space of a million characters of text.

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, for example, programs only in English language, because after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes per the same character in UTF-16).

It was precisely to satisfy everyone and everything in the Unicode consortium that it was decided to come up with variable length encoding. It was called UTF-8. Despite the eight in the name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length.

In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII.

What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters are encoded in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now they even come in sets.

In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in that they are created for different encodings, but by the fact that the font manufacturer either filled or did not completely fill the single code space with certain vector forms.

Crazy words instead of Russian letters - how to fix it

Let's now see how krakozyabrs appear instead of text or, in other words, how the correct encoding for Russian text is selected. Actually, it is set in the program in which you create or edit this very text, or code using text fragments.

For editing and creating text files Personally, I use a very good one, in my opinion, . However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read a detailed review of this wonderful program at the link provided.

IN top menu Notepad++ has an “Encodings” item, where you will have the opportunity to convert an existing option to the one used by default on your site:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should choose the option to avoid the appearance of cracks UTF 8 without BOM. What is the BOM prefix?

The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents.

In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs.

What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out.

Therefore, under no circumstances don't use regular Windows Notepad to edit documents on your site if you don’t want any cracks to appear. The best and most simple option I consider the already mentioned Notepad++ editor, which has practically no disadvantages and consists only of advantages.

In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from?

It is registered in your register operating system Windows - which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you set another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid rednecks, in addition to the actions described above, it will be useful to write in its header source code all pages of the site information about this very encoding, so that there is no confusion on the server or local host.

In general, in all languages hypertext markup In addition to Html, a special xml declaration is used, which indicates the text encoding.

Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

In the case of a document HTML language used to indicate encoding Meta element, which is written between the opening and closing Head tags:

... ...

This entry is quite different from the one adopted in, but is fully consistent with the new one being gradually introduced HTML standard 5, and it will be completely understood correctly by anyone used on this moment browsers.

In theory, a Meta element with an indication HTML encodings it would be better to put the document as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

Good luck to you! See you soon on the pages of the blog site

You can watch more videos by going to
");">

You might be interested

What's happened URL addresses What is the difference between absolute and relative links for a site?
OpenServer - modern local server and an example of its use for WordPress installations on computer
What is Chmod, what permissions to assign to files and folders (777, 755, 666) and how to do it via PHP
Yandex search by site and online store

As you know, a computer stores information in binary form, representing it as a sequence of ones and zeros. To translate information into a form convenient for human perception, each unique sequence of numbers is replaced by its corresponding symbol when displayed.

One of the systems for correlating binary codes with printed and control characters is

At today's level of development computer technology the user is not required to know the code of each specific character. However general understanding how coding is carried out is extremely useful, and for some categories of specialists even necessary.

Creating ASCII

The encoding was originally developed in 1963 and then updated twice over the course of 25 years.

In the original version, the ASCII character table included 128 characters; later an extended version appeared, where the first 128 characters were saved, and previously missing characters were assigned to codes with the eighth bit involved.

For many years, this encoding was the most popular in the world. In 2006, Latin 1252 took the leading position, and from the end of 2007 to the present, Unicode has firmly held the leading position.

Computer representation of ASCII

Each ASCII character has its own code, consisting of 8 characters representing a zero or a one. The minimum number in this representation is zero (eight zeros in the binary system), which is the code of the first element in the table.

Two codes in the table were reserved for switching between standard US-ASCII and its national variant.

After ASCII began to include not 128, but 256 characters, an encoding variant became widespread, in which the original version of the table was stored in the first 128 codes with the 8th bit zero. National written characters were stored in the upper half of the table (positions 128-255).

The user does not need to know the ASCII character codes directly. A software developer usually only needs to know the element number in the table to calculate its code using the binary system if necessary.

Russian language

After the development in the early 70s of encodings for the Scandinavian languages, Chinese, Korean, Greek, etc., the creation own version The Soviet Union also became involved. Soon, a version of an 8-bit encoding called KOI8 was developed, preserving the first 128 ASCII character codes and allocating the same number of positions for letters of the national alphabet and additional characters.

Before the introduction of Unicode, KOI8 dominated the Russian segment of the Internet. There were encoding options for both the Russian and Ukrainian alphabet.

ASCII problems

Since the number of elements even in the extended table did not exceed 256, there was no possibility of accommodating several different scripts in one encoding. In the 90s, the “crocozyabr” problem appeared on the Runet, when texts typed in Russian ASCII characters were displayed incorrectly.

The problem was a code mismatch various options ASCII to each other. Let us remember that various characters could be located in positions 128-255, and when changing one Cyrillic encoding to another, all letters of the text were replaced with others having an identical number in a different version of the encoding.

Current state

With the advent of Unicode, the popularity of ASCII began to decline sharply.

The reason for this lies in the fact that the new encoding made it possible to accommodate characters from almost all written languages. In this case, the first 128 ASCII characters correspond to the same characters in Unicode.

In 2000, ASCII was the most popular encoding on the Internet and was used on 60% of web pages indexed by Google. By 2012, the share of such pages had dropped to 17%, and Unicode (UTF-8) took the place of the most popular encoding.

So ASCII is important part history of information technology, but its use in the future seems unpromising.







2024 gtavrl.ru.