How many characters are in the unicode table. Using Unicode with @font-face icons


Unicode is an international character encoding standard that allows text to be displayed consistently on any computer in the world, regardless of the system language it uses.

Basics

To understand what a Unicode character table is for, let's first understand the mechanism for displaying text on a monitor screen. A computer, as we know, processes all information in digital form, and must display it graphically for correct human perception. Thus, in order for us to read this text, we need to solve at least two problems:

  • Encode printed characters into digital form.
  • Provide the operating system with the ability to match the digital form with the vector characters, in other words, find the correct letters.

First encodings

American ASCII is considered to be the ancestor of all encodings. It described the Latin alphabet with punctuation marks and Arabic numerals used in English. It was the 128 characters used in it that became the basis for subsequent developments - even the modern Unicode character table uses them. Letters of the Latin alphabet have since occupied the first positions in any encoding.

In total, ASCII allowed 256 characters to be stored, but since the first 128 were occupied by the Latin alphabet, the remaining 128 began to be used throughout the world to create national standards. For example, in Russia, CP866 and KOI8-R were created on its basis. Such variations were called extended versions of ASCII.

Code pages and "krakozyabry"

Further development of technology and the emergence of a graphical interface led to the creation of the ANSI encoding by the American Standards Institute. Russian users, especially with experience, its version is known under named Windows 1251. The concept of “code page” was first used in it. It was with the help of code pages that contained characters from national alphabets other than Latin that “mutual understanding” was established between computers used in different countries.

However, the presence of a large number of different encodings used for one language began to cause problems. The so-called krakozyabrs appeared. They arose from a discrepancy between the source code page in which any information was created and the default code page used on the end user’s computer.

As an example, we can cite the above Cyrillic encodings CP866 and KOI8-R. The letters in them differed in code positions and placement principles. In the first they were placed in alphabetical order, and in the second - in an arbitrary one. You can imagine what was happening before the eyes of a user who tried to open such a text without having the required code page or when it was interpreted incorrectly by the computer.

Creation of Unicode

The spread of the Internet and related technologies, such as e-mail, led to the fact that in the end the situation with the distortion of texts ceased to suit everyone. Leading IT companies have formed the Unicode Consortium. The character table he introduced in 1991, called UTF-32, could store more than a billion unique characters. It was the most important step on the way to deciphering texts.

However, the first universal table of character codes, Unicode UTF-32, was not widely used. The main reason was the redundancy of stored information. It was quickly calculated that for countries using the Latin alphabet encoded using the new universal table, the text would take up four times as much space as using the extended ASCII table.

Development of Unicode

The following Unicode UTF-16 character table fixed this problem. Encoding in it was carried out with half the number of bits, but at the same time the number of possible combinations. Instead of billions of characters, it allows you to store only 65,536. Nevertheless, it turned out to be so successful that this number, according to the decision of the Consortium, was defined as the base storage space for Unicode standard characters.

Despite this success, UTF-16 did not suit everyone, since the volume of stored and transmitted information was still doubled. A universal solution became UTF-8, a variable-length Unicode character table. This can be called a breakthrough in this area.

Thus, with the introduction of the last two standards, the Unicode character table solved the problem of a single code space for all fonts currently in use.

Unicode for Russian language

Due to the variable length of the code used to display characters, the Latin alphabet is encoded in Unicode in the same way as in its ancestor ASCII, that is, in one bit. For other alphabets the picture may look different. For example, characters of the Georgian alphabet use three bytes for encoding, and characters of the Cyrillic alphabet use two. All this is possible within the framework of using the UTF-8 Unicode standard (character table). The Russian language or Cyrillic alphabet occupies 448 positions in the general code space, divided into five blocks.

These five blocks include the basic Cyrillic and Church Slavonic alphabet, as well as additional letters from other languages ​​that use the Cyrillic alphabet. A number of positions have been allocated to display old forms of representing Cyrillic letters, and 22 positions out of the total are still free.

Current version of Unicode

With the solution of its primary task, which was to standardize fonts and create a single code space for them, the Consortium did not stop its work. Unicode is constantly evolving and expanding. The latest current version of this standard, 9.0, was released in 2016. It included six additional alphabets and expanded the list of standardized emoji.

It must be said that in order to simplify research, even so-called dead languages ​​are added to Unicode. They received this name because there are no people for whom it is native. This group also includes languages ​​that have reached our time only in the form of written monuments.

In principle, anyone can apply to have characters added to the new Unicode specification. True, for this you will have to fill out a decent amount of initial documents and spend a lot of time. A living example of this is the story of programmer Terence Eden. In 2013, he filed an application to include in the specification symbols related to the designation of computer power control buttons. They have been used in technical documentation since the mid-70s of the last century, but were not part of Unicode until the advent of the 9.0 specification.

symbol table

Every computer, regardless of the operating system used, uses the Unicode character table. How to use these tables, where to find them and why can they be useful to the average user?

In Windows OS, the symbol table is located in the “Utilities” menu section. In the Linux family of operating systems it can usually be found in the “Standards” subsection, and in MacOS it is in the keyboard settings. The main purpose of this table is to enter characters that are not located on the keyboard into text documents.

The widest range of applications for such tables can be found: from entering technical symbols and icons of national monetary systems to writing instructions for practical application Tarot cards.

Finally

Unicode is used everywhere and entered our lives with the development of the Internet and mobile technologies. Thanks to its use, the system of interethnic communications has been significantly simplified. We can say that the introduction of Unicode is an illustrative, but completely invisible from the outside, example of the use of technology for the common good of all humanity.

Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern Unicode consortium encodings UTF 16 and 8. Table of contents:

  • Extended versions of Asuka - CP866 and KOI8-R encodings
  • Windows 1251 - ASCII variation and why krakozyabry come out
To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

ASCII - basic text encoding for the Latin alphabet

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters. But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as “aski”). It describes the first 128 characters most frequently used by English-speaking users - letters, Arabic numerals and punctuation marks. These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:
It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order. But the fact is that with one byte of information you can encode not 128, but as many as 256 different values ​​(two to the power of eight equals 256), so following basic version A whole series of Asukas appeared extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian). Here, it’s probably worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). One byte consists of eight bits, each of which is a power of two, starting from zero, and ending with two to the seventh power:
It is not difficult to understand that all possible combinations of zeros and ones in such a construction can only be 256. Convert a number from binary system to decimal is quite simple. You just need to add up all the powers of two with ones above them. In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). The total is 233 in decimal notation. As you can see, everything is very simple. But if you look closely at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" matches in Aski hexadecimal number 2A. You probably know that in the hexadecimal number system, in addition to Arabic numerals, Latin letters from A (means ten) to F (means fifteen) are also used. Well then, for translation binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. in each half byte binary code only sixteen values ​​can be encoded (two to the fourth power), which can easily be represented as a hexadecimal number. Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8). Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. It became possible to add symbols of letters of your language to Aski. Here we will need to digress again to explain - Why do we need text encodings at all? and why it's so important. The characters on your computer screen are formed on the basis of two things - sets of vector shapes (representations) of all kinds of characters (they are in files with fonts that are installed on your computer) and code that allows you to pull out exactly that one from this set of vector shapes (font file). symbol that will need to be inserted in the right place. It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text. The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required file font that is connected to display this text document. Everything is simple and banal. This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, such options exist a whole bunch. Just for encoding Russian language characters, there are several varieties of extended Aska. For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII. Those. her top part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the form indicated in the screenshot just below and allowed you to encode another 128 characters (Russian letters and all sorts of pseudo-graphics):
You see, in the right column the numbers start with 8, because... numbers from 0 to 7 refer to the basic part of ASCII (see first screenshot). That. The Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will appear in the text. Where did this amount come from? pseudographics in CP866? The whole point is that this encoding for Russian text was developed back in those shaggy years when graphical operating systems were not as widespread as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it. CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:
The principle of its operation remains the same as that of the CP866 described a little earlier - each character of text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article. Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, they did it in CP866. If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

Windows 1251 - modern version of ASCII and why the cracks come out

The further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose that, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic symbols. They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this could be Windows 1251. It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in similar Russian Slavic languages(Ukrainian, Belarusian, etc.):
Due to such an abundance of Russian language encodings, font manufacturers and manufacturers software headaches constantly arose, and you and I, dear readers, often got those same notorious krakozyabry when there was confusion with the version used in the text. Very often they came out when sending and receiving messages by e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem fundamentally, and often users used transliteration of Latin letters for correspondence in order to avoid the notorious gibberish when using Russian encodings like CP866, KOI8-R or Windows 1251. In fact, the cracks that appeared instead of Russian text were the result of incorrect use of the encoding of this language, which did not match the one in which it was encoded text message initially. For example, if you try to display characters encoded using CP866 using the code Windows table 1251, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message. A similar situation very often arises when creating and setting up websites, forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong encoding text editor, which adds gags to the code that are not visible to the naked eye. In the end, many people got tired of this situation with a lot of encodings and constantly creeping out crap, and the prerequisites appeared for the creation of a new universal variation that would replace all the existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where there were much more language characters than 256.

Unicode - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not possibly be described in one byte of information that was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding. The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding. As a result, the same file with text encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any real required value with a colossal reserve). But for many countries with languages ​​of the European group this great amount There was no need to use characters in the encoding at all, but when UTF-32 was used, they would never have received a fourfold increase in weight text documents, and as a result, an increase in the volume of Internet traffic and the amount of stored data. This is a lot, and no one could afford such waste. As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks. In the Windows operating system, you can follow the path “Start” - “Programs” - “Accessories” - “System Tools” - “Character Table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select in " Additional options» set of Unicode characters, you can see for each font separately the entire range of characters included in it. By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits: How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and this is the number that was adopted as the base space in Unicode. In addition, there are ways to encode about two million characters using it, but they were limited to an expanded space of a million characters of text. But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English, because for them, after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes for the same character in YUTF-16). It was precisely for the satisfaction of everyone and everything in the Unicode consortium that it was decided come up with an encoding variable length. It was called UTF-8. Despite the eight in the name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length. In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII. What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium. Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian ones - in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in that they are created for different encodings, but by the fact that the font manufacturer either filled or did not completely fill the single code space with certain vector forms.

Crazy words instead of Russian letters - how to fix it

Let's now see how krakozyabrs appear instead of text or, in other words, how the correct encoding for Russian text is selected. Actually, it is set in the program in which you create or edit this very text, or code using text fragments. To edit and create text files, I personally use a very good, in my opinion, Html and PHP editor Notepad++. However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read detailed review this wonderful program at the link provided. In the top menu of Notepad++ there is an item “Encodings”, where you will have the opportunity to convert an existing option to the one used by default on your site:
In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should select the option to avoid the appearance of cracks UTF 8 without BOM. What is the BOM prefix? The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents. In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs. What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out. Therefore, under no circumstances don't use regular Windows Notepad to edit documents on your site if you don’t want any cracks to appear. The best and most simple option I consider the already mentioned Notepad++ editor, which has practically no disadvantages and consists only of advantages. In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from? It is registered in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you install another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language. After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor: To avoid rednecks, in addition to the actions described above, it will be useful to write in its header source code all pages of the site information about this very encoding, so that there is no confusion on the server or local host. In general, all hypertext markup languages ​​except Html use a special xml declaration, which specifies the text encoding.< ? xml version= "1.0" encoding= "windows-1251" ? >Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM). In the case of a document HTML language used to indicate encoding Meta element, which is written between the opening and closing Head tags: < head> . . . < meta charset= "utf-8" > . . . < / head>This entry is quite different from the one adopted in the standard in Html 4.01, but is fully consistent with the new Html 5 standard that is being gradually introduced, and it will be completely understood correctly by anyone using this moment browsers. In theory, a Meta element with an indication HTML encodings it would be better to put the document as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters. Link to first

Unicode is a very large and complex world, because the standard allows you to represent and work on a computer with all the main scripts of the world. Some writing systems have existed for more than a thousand years, with many of them developing almost independently of each other in different parts of the world. People have come up with so many things and they are often so different from each other that combining them all into a single standard was an extremely difficult and ambitious task.

To truly understand Unicode, you need to have at least a superficial understanding of the features of all the scripts that the standard allows you to work with. But is this what every developer needs? We'll say no. To use Unicode for most everyday tasks, it is enough to know a reasonable minimum of knowledge, and then delve into the standard as needed.

In this article we will talk about the basic principles of Unicode and highlight those important practical issues that developers will certainly encounter in their daily work.

Why was Unicode needed?

Before the advent of Unicode, single-byte encodings were almost universally used, in which the boundary between the characters themselves, their representation in computer memory and display on the screen was quite arbitrary. If you worked with one or another national language, then the corresponding encoding fonts were installed in your system, which allowed you to draw bytes from the disk on the screen in such a way that they made sense to the user.

If you printed on a printer text file and on the paper page you saw a set of incomprehensible gibberish, which meant that the appropriate fonts were not loaded into the printing device and it did not interpret the bytes the way you would like.

This approach in general and single-byte encodings in particular had a number of significant disadvantages:

  1. It was possible to simultaneously work with only 256 characters, with the first 128 being reserved for Latin and control characters, and in the second half, in addition to the symbols of the national alphabet, it was necessary to find a place for pseudographic symbols (╔ ╗).
  2. Fonts were tied to a specific encoding.
  3. Each encoding represented its own set of characters, and conversion from one to another was possible only with partial losses, when missing characters were replaced with graphically similar ones.
  4. Transferring files between devices running different operating systems was difficult. You either had to have a converter program or carry additional fonts along with the file. The existence of the Internet as we know it was impossible.
  5. There are non-alphabetic writing systems in the world (hieroglyphic writing), which in principle cannot be represented in a single-byte encoding.

Basic principles of Unicode

We all understand perfectly well that a computer does not know about any ideal entities, but operates with bits and bytes. But computer systems While people create, not machines, sometimes it is more convenient for you and me to operate with speculative concepts, and then move from the abstract to the concrete.

Important! One of the central principles in the Unicode philosophy is the clear distinction between characters, their representation in a computer, and their display on an output device.

The concept of an abstract Unicode character is introduced, existing solely in the form of a speculative concept and agreement between people, enshrined in the standard. Each Unicode character is associated with a non-negative integer called its code point.

So, for example, the Unicode character U+041F is a capital cyrillic letter P. There are several presentation possibilities. of this symbol in the computer’s memory, just like several thousand ways to display it on the monitor screen. But at the same time P, it will also be P or U+041F in Africa.

This is the familiar encapsulation or separation of the interface from the implementation - a concept that has proven itself well in programming.

It turns out that, guided by the standard, any text can be encoded as a sequence of Unicode characters

Hello U+041F U+0440 U+0438 U+0432 U+0435 U+0442

write it down on a piece of paper, pack it in an envelope and send it to any part of the world. If they know about the existence of Unicode, then the text will be perceived by them in exactly the same way as by you and me. They will not have the slightest doubt that the penultimate character is the Cyrillic lowercase e(U+0435), and not say Latin small e(U+0065). Notice that we haven't said a word about the byte representation.

Unicode code space

The Unicode code space consists of 1,114,112 code positions ranging from 0 to 10FFFF. Of these, only 128,237 have been assigned values ​​for the ninth version of the standard. Some of the space is reserved for private use and the Unicode Consortium promises never to assign values ​​to positions in these special areas.

For the sake of convenience, the entire space is divided into 17 planes (six of them are currently used). Until recently, it was commonly said that most likely you would only encounter the Basic Multilingual Plane (BMP), which includes Unicode characters from U+0000 to U+FFFF. (Looking ahead a little: characters from BMP are represented in UTF-16 by two bytes, not four). In 2016, this thesis is already in doubt. For example, popular Emoji characters may well appear in a user message and you need to be able to process them correctly.

Encodings

If we want to send text over the Internet, we will need to encode a sequence of Unicode characters as a sequence of bytes.

The Unicode standard includes a description of a number of Unicode encodings, such as UTF-8 and UTF-16BE/UTF-16LE, which allow the entire character space to be encoded. Conversion between these encodings can be freely carried out without loss of information.

Also, no one has canceled single-byte encodings, but they allow you to encode your own individual and very narrow piece of the Unicode spectrum - 256 or less code positions. For such encodings, tables exist and are available to everyone, where each value of a single byte is associated with a Unicode character (see, for example, CP1251.TXT). Despite the limitations, single-byte encodings turn out to be very practical when it comes to working with a large array of monolingual text information.

Of the Unicode encodings, the most common on the Internet is UTF-8 (it won the palm in 2008), mainly due to its efficiency and transparent compatibility with seven-bit ASCII. Latin and service characters, basic punctuation marks and numbers - i.e. all seven-bit ASCII characters are encoded in UTF-8 as one byte, the same as in ASCII. The characters of many major scripts, not counting some rarer hieroglyphic characters, are represented in it by two or three bytes. The largest code position defined by the standard, 10FFFF, is encoded in four bytes.

Please note that UTF-8 is a variable code length encoding. Each Unicode character in it is represented by a sequence of code quantums with a minimum length of one quantum. The number 8 means the bit length of the code unit (code unit) - 8 bits. For the UTF-16 encoding family, the size of the code quantum is, accordingly, 16 bits. For UTF-32 - 32 bits.

If you are sending an HTML page with Cyrillic text over the network, then UTF-8 can give a very significant benefit, because all markup, as well as JavaScript and CSS blocks, will be effectively encoded in one byte. Eg home page Habra in UTF-8 takes 139Kb, and in UTF-16 it is already 256Kb. For comparison, if you use win-1251 with the loss of the ability to save some characters, the size will be reduced by only 11Kb.

To store string information in applications, 16-bit Unicode encodings are often used due to their simplicity, as well as the fact that the characters of the world's major writing systems are encoded in one sixteen-bit quantum. For example, Java successfully uses UTF-16 for internal string representation. operating room Windows system also uses UTF-16 internally.

In any case, as long as we remain in the Unicode space, it doesn't really matter how string information is stored within separate application. If the internal storage format allows you to correctly encode all the million-plus code positions and there is no loss of information at the application boundary, for example when reading from a file or copying to the clipboard, then everything is fine.

To correctly interpret text read from a disk or from a network socket, you must first determine its encoding. This is done either using meta-information provided by the user, written in or adjacent to the text, or determined heuristically.

Bottom line

There is a lot of information and it makes sense to give a brief summary of everything that was written above:

  • Unicode postulates a clear distinction between characters, their representation in a computer, and their display on an output device.
  • The Unicode code space consists of 1,114,112 code positions ranging from 0 to 10FFFF.
  • The basic multilingual plane includes Unicode characters U+0000 to U+FFFF, which are encoded in UTF-16 as two bytes.
  • Any Unicode encoding allows you to encode the entire space of Unicode code positions, and conversion between different such encodings is carried out without loss of information.
  • Single-byte encodings allow you to encode only a small part of the Unicode spectrum, but can be useful when working with large volume monolingual information.
  • UTF-8 and UTF-16 encodings have variable code lengths. In UTF-8, each Unicode character can be encoded in one, two, three, or four bytes. In UTF-16 - two or four bytes.
  • The internal format for storing text information within a separate application can be arbitrary, provided correct operation with the entire Unicode code position space and no loss during cross-border data transfer.

A quick note about coding

There can be some confusion with the term coding. Within Unicode, encoding occurs twice. The first time a Unicode character set is encoded, in the sense that each Unicode character is assigned a corresponding code position. This process turns the Unicode character set into a coded character set. The second time, the sequence of Unicode characters is converted into a byte string and this process is also called encoding.

In English terminology, there are two different verbs, to code and to encode, but even native speakers often get confused with them. In addition, the term character set or charset is used as a synonym for the term coded character set.

We say all this to the fact that it makes sense to pay attention to the context and distinguish between situations when we are talking about the code position of an abstract Unicode character and when we are talking about its byte representation.

Finally

There's so much in Unicode various aspects that it is impossible to cover everything within one article. And it's unnecessary. The above information is enough to avoid confusion in the basic principles and work with text in most everyday tasks (read: without going beyond the BMP). In the following articles we will talk about normalization, give a more complete historical overview of the development of encodings, talk about the problems of Russian-language Unicode terminology, and also make material about practical aspects using UTF-8 and UTF-16.

Unicode: UTF-8, UTF-16, UTF-32.

Unicode is a set graphic symbols and a method for encoding them for computer processing of text data.

Unicode not only assigns a unique code to each character, but also defines various characteristics this character, for example:

    character type (capital letter, lowercase letter, number, punctuation mark, etc.);

    character attributes (display from left to right or right to left, space, line break, etc.);

    the corresponding uppercase or lowercase letter (for lowercase and uppercase letters, respectively);

    appropriate numeric value(for digital characters).

    Standards UTF(an abbreviation for Unicode Transformation Format) to represent characters:

UTF-16: In Windows, setup, acceleration, frequently asked questions Vista uses UTF-16 encoding to represent all Unicode characters. In UTF-16, characters are represented as two bytes (16 bits). This encoding is used in Windows because 16-bit values ​​can represent the characters that make up the alphabets of most languages, allowing programs to process strings and calculate their length faster. However, 16 bits are not enough to represent the alphabet characters of some languages. For such cases, UTE-16 supports "surrogate" encodings, which allow characters to be encoded in 32 bits (4 bytes). However, there are few applications that have to deal with characters from such languages, so UTF-16 is a good compromise between saving memory and ease of programming. Note that the .NET Framework encodes all characters using UTF-16, so using UTF-16 in Windows applications Improves performance and reduces memory consumption when passing strings between native and managed code.

UTF-8: UTF-8 encoded different symbols can be represented by 1,2,3 or 4 bytes. Characters with values ​​less than 0x0080 are compressed to 1 byte, which is very convenient for characters used in the United States. Characters that match values ​​in the range 0x0080-0x07FF are converted to 2-byte values, which works well with European and Middle Eastern alphabets. Characters with larger values ​​are converted to 3-byte values, convenient when working with Central Asian languages. Finally, the "surrogate" pairs are written in a 4-byte format. UTF-8 is an extremely popular encoding. However, it is less effective than UTF-16 if characters with values ​​0x0800 and higher are frequently used.

UTF-32: In UTF-32, all characters are represented by 4 bytes. This encoding is convenient for writing simple algorithms for enumerating characters of any language that do not require processing characters represented by a different number of bytes. For example, when using UTF-32, you can forget about “surrogates”, since any character in this encoding is represented by 4 bytes. It is clear that in terms of memory usage, UTF-32 is far from ideal. Therefore, this encoding is rarely used to transmit strings over the network and save them to files. As a rule, UTF-32 is used as an internal format for presenting data in a program.

UTF-8

In the near future, a special Unicode (and ISO 10646) format called UTF-8. This “derived” encoding uses strings of bytes of varying lengths (from one to six) to write characters, which are converted into Unicode codes using a simple algorithm, with shorter strings corresponding to more common characters. The main advantage of this format is its compatibility with ASCII not only in terms of code values, but also in the number of bits per character, since one byte is enough to encode any of the first 128 characters in UTF-8 (although, for example, for Cyrillic letters you already need two bytes).

The UTF-8 format was invented on September 2, 1992 by Ken Thompson and Rob Pike and implemented in Plan 9. The UTF-8 standard is now formalized in RFC 3629 and ISO/IEC 10646 Annex D.

For Web designer This encoding is of particular importance because it has been declared the "standard document encoding" in HTML since version 4.

Text consisting only of characters numbered less than 128 is converted to plain ASCII text when written in UTF-8. Conversely, in UTF-8 text, any byte with a value less than 128 represents an ASCII character with the same code. The remaining Unicode characters are represented by sequences of length from 2 to 6 bytes (actually only up to 4 bytes, since the use of codes larger than 221 is not planned), in which the first byte is always 11xxxxxx, and the rest are 10xxxxxx.

Simply put, in UTF-8, Latin characters, punctuation, and ASCII control characters are written in US-ASCII codes, and all other characters are encoded using multiple octets with the most significant bit 1. This has two effects.

    Even if the program does not recognize Unicode, Latin letters, Arabic numerals and punctuation marks will be displayed correctly.

    If Latin letters and simple punctuation marks (including spaces) occupy a significant amount of text, UTF-8 provides a gain in volume compared to UTF-16.

    At first glance, it may seem that UTF-16 is more convenient, since most characters are encoded in exactly two bytes. However, this is negated by the need to support surrogate pairs, which are often forgotten when using UTF-16, implementing only support for UCS-2 characters.

Unicode is a very large and complex world, because the standard allows you to represent and work on a computer with all the main scripts of the world. Some writing systems have existed for more than a thousand years, with many of them developing almost independently of each other in different parts of the world. People have come up with so many things and they are often so different from each other that combining them all into a single standard was an extremely difficult and ambitious task.

To truly understand Unicode, you need to have at least a superficial understanding of the features of all the scripts that the standard allows you to work with. But is this what every developer needs? We'll say no. To use Unicode for most everyday tasks, it is enough to know a reasonable minimum of knowledge, and then delve into the standard as needed.

In this article we will talk about the basic principles of Unicode and highlight those important practical issues that developers will certainly encounter in their daily work.

Why was Unicode needed?

Before the advent of Unicode, single-byte encodings were almost universally used, in which the boundary between the characters themselves, their representation in computer memory and display on the screen was quite arbitrary. If you worked with one or another national language, then the corresponding encoding fonts were installed in your system, which allowed you to draw bytes from the disk on the screen in such a way that they made sense to the user.

If you were printing a text file on a printer and saw a bunch of incomprehensible gibberish on the paper page, this meant that the appropriate fonts were not loaded into the printing device and it was not interpreting the bytes the way you would like.

This approach in general and single-byte encodings in particular had a number of significant disadvantages:

  1. It was possible to simultaneously work with only 256 characters, with the first 128 being reserved for Latin and control characters, and in the second half, in addition to the symbols of the national alphabet, it was necessary to find a place for pseudographic symbols (╔ ╗).
  2. Fonts were tied to a specific encoding.
  3. Each encoding represented its own set of characters, and conversion from one to another was possible only with partial losses, when missing characters were replaced with graphically similar ones.
  4. Transferring files between devices running different operating systems was difficult. You either had to have a converter program or carry additional fonts along with the file. The existence of the Internet as we know it was impossible.
  5. There are non-alphabetic writing systems in the world (hieroglyphic writing), which in principle cannot be represented in a single-byte encoding.

Basic principles of Unicode

We all understand perfectly well that a computer does not know about any ideal entities, but operates with bits and bytes. But computer systems are still created by people, not machines, and for you and me, sometimes it is more convenient to operate with speculative concepts, and then move from the abstract to the concrete.

Important! One of the central principles in the Unicode philosophy is the clear distinction between characters, their representation in a computer, and their display on an output device.

The concept of an abstract Unicode character is introduced, existing solely in the form of a speculative concept and agreement between people, enshrined in the standard. Each Unicode character is associated with a non-negative integer called its code point.

For example, the Unicode character U+041F is the capital Cyrillic letter P. There are several ways to represent this character in the computer’s memory, as well as several thousand ways to display it on the monitor screen. But at the same time P, it will also be P or U+041F in Africa.

This is the familiar encapsulation or separation of the interface from the implementation - a concept that has proven itself well in programming.

It turns out that, guided by the standard, any text can be encoded as a sequence of Unicode characters

Hello U+041F U+0440 U+0438 U+0432 U+0435 U+0442
write it down on a piece of paper, pack it in an envelope and send it to any part of the world. If they know about the existence of Unicode, then the text will be perceived by them in exactly the same way as by you and me. They will not have the slightest doubt that the penultimate character is the Cyrillic lowercase e(U+0435), and not say Latin small e(U+0065). Notice that we haven't said a word about the byte representation.

Although Unicode characters are called symbols, they do not always correspond to a symbol in the traditional naive sense, such as a letter, number, punctuation mark or hieroglyph. (See spoiler for more details.)

Examples of various Unicode characters

There are purely technical Unicode characters, for example:

  • U+0000: null character;
  • U+D800–U+DFFF: minor and major surrogates for the technical representation of code points in the range from 10000 to 10FFFF (read: outside the BML/BMP) in the UTF-16 encoding family;
  • etc.
There are punctuation markers, for example U+200F: a marker for changing the direction of writing from right to left.

There is a whole cohort of spaces of various widths and purposes (see excellent habra article:):

  • U+0020 (space);
  • U+00A0 ( non-breaking space, in HTML);
  • U+2002 (semicircular embed or En Space);
  • U+2003 (embed or Em Space);
  • etc.
There are combinable diacritics(combining diacritical marks) - all kinds of strokes, dots, tildes, etc., which change/clarify the meaning of the previous sign and its outline. For example:
  • U+0300 and U+0301: signs of primary (acute) and secondary (weak) stress;
  • U+0306: short (superscript), as in th;
  • U+0303: superscript tilde;
  • etc.
There is even something exotic like language tags (U+E0001, U+E0020–U+E007E, and U+E007F), which are currently in limbo. They were intended as a way to mark certain sections of text as belonging to a particular language variant (say, American and British English), which could influence the details of how the text was displayed.

We will tell you next time what a symbol is, how a grapheme cluster (read: perceived as a single image of a symbol) differs from a Unicode symbol and from a code quantum.

Unicode code space

The Unicode code space consists of 1,114,112 code positions ranging from 0 to 10FFFF. Of these, only 128,237 have been assigned values ​​for the ninth version of the standard. Some of the space is reserved for private use and the Unicode Consortium promises never to assign values ​​to positions in these special areas.

For the sake of convenience, the entire space is divided into 17 planes (six of them are currently used). Until recently, it was commonly said that most likely you would only encounter the Basic Multilingual Plane (BMP), which includes Unicode characters from U+0000 to U+FFFF. (Looking ahead a little: characters from BMP are represented in UTF-16 by two bytes, not four). In 2016, this thesis is already in doubt. For example, popular Emoji characters may well appear in a user message and you need to be able to process them correctly.

Encodings

If we want to send text over the Internet, we will need to encode a sequence of Unicode characters as a sequence of bytes.

The Unicode standard includes a description of a number of Unicode encodings, such as UTF-8 and UTF-16BE/UTF-16LE, which allow the entire character space to be encoded. Conversion between these encodings can be freely carried out without loss of information.

Also, no one has canceled single-byte encodings, but they allow you to encode your own individual and very narrow piece of the Unicode spectrum - 256 or less code positions. For such encodings, tables exist and are available to everyone, where each value of a single byte is associated with a Unicode character (see, for example, CP1251.TXT). Despite the limitations, single-byte encodings turn out to be very practical when it comes to working with a large array of monolingual text information.

Of the Unicode encodings, the most common on the Internet is UTF-8 (it won the palm in 2008), mainly due to its efficiency and transparent compatibility with seven-bit ASCII. Latin and service characters, basic punctuation marks and numbers - i.e. all seven-bit ASCII characters are encoded in UTF-8 as one byte, the same as in ASCII. The characters of many major scripts, not counting some rarer hieroglyphic characters, are represented in it by two or three bytes. The largest code position defined by the standard, 10FFFF, is encoded in four bytes.

Please note that UTF-8 is a variable code length encoding. Each Unicode character in it is represented by a sequence of code quantums with a minimum length of one quantum. The number 8 means the bit length of the code unit (code unit) - 8 bits. For the UTF-16 encoding family, the size of the code quantum is, accordingly, 16 bits. For UTF-32 - 32 bits.

If you are sending an HTML page with Cyrillic text over the network, then UTF-8 can give a very significant benefit, because all markup, as well as JavaScript and CSS blocks, will be effectively encoded in one byte. For example, the main page of Habr in UTF-8 occupies 139Kb, and in UTF-16 it is already 256Kb. For comparison, if you use win-1251 with the loss of the ability to save some characters, then the size, compared to UTF-8, will be reduced by only 11Kb to 128Kb.

To store string information in applications, 16-bit Unicode encodings are often used due to their simplicity, as well as the fact that the characters of the world's major writing systems are encoded in one sixteen-bit quantum. For example, Java successfully uses UTF-16 for internal string representation. operating system Windows also uses UTF-16 internally.

In any case, as long as we remain in the Unicode space, it doesn't really matter how string information is stored within an individual application. If the internal storage format allows you to correctly encode all the million-plus code positions and there is no loss of information at the application boundary, for example when reading from a file or copying to the clipboard, then everything is fine.

To correctly interpret text read from a disk or from a network socket, you must first determine its encoding. This is done either using meta-information provided by the user, written in or adjacent to the text, or determined heuristically.

Bottom line

There is a lot of information and it makes sense to give a brief summary of everything that was written above:
  • Unicode postulates a clear distinction between characters, their representation in a computer, and their display on an output device.
  • Unicode characters do not always correspond to a character in the traditional-naive sense, such as a letter, number, punctuation mark or hieroglyph.
  • The Unicode code space consists of 1,114,112 code positions ranging from 0 to 10FFFF.
  • The basic multilingual plane includes Unicode characters U+0000 to U+FFFF, which are encoded in UTF-16 as two bytes.
  • Any Unicode encoding allows you to encode the entire space of Unicode code positions, and conversion between different such encodings is carried out without loss of information.
  • Single-byte encodings allow encoding only a small part of the Unicode spectrum, but can be useful when working with large amounts of monolingual information.
  • UTF-8 and UTF-16 encodings have variable code lengths. In UTF-8, each Unicode character can be encoded in one, two, three, or four bytes. In UTF-16 - two or four bytes.
  • The internal format for storing text information within a separate application can be arbitrary, provided it works correctly with the entire space of Unicode code positions and there are no losses during cross-border data transfer.

A quick note about coding

There can be some confusion with the term coding. Within Unicode, encoding occurs twice. The first time a Unicode character set is encoded, in the sense that each Unicode character is assigned a corresponding code position. This process turns the Unicode character set into a coded character set. The second time, the sequence of Unicode characters is converted into a byte string and this process is also called encoding.

In English terminology, there are two different verbs, to code and to encode, but even native speakers often get confused with them. In addition, the term character set or charset is used as a synonym for the term coded character set.

We say all this to the fact that it makes sense to pay attention to the context and distinguish between situations when we are talking about the code position of an abstract Unicode character and when we are talking about its byte representation.

Finally

There are so many different aspects of Unicode that it is impossible to cover everything in one article. And it's unnecessary. The above information is enough to avoid confusion in the basic principles and work with text in most everyday tasks (read: without going beyond the BMP). In the following articles we will talk about normalization, give a more complete historical overview of the development of encodings, talk about the problems of Russian-language Unicode terminology, and also make material about the practical aspects of using UTF-8 and UTF-16.






2024 gtavrl.ru.