The international unicode encoding system provides. Unicode: character encoding standard

Useful tips

Since in a number of computer systems (for example, Windows NT) fixed 16-bit characters were already used as the default encoding, it was decided to encode all the most important characters only within the first 65,536 positions (the so-called English. basic multilingual plane, BMP). The rest of the space is used for "extra characters". supplementary characters): writing systems of extinct languages or very rarely used Chinese characters, mathematical and musical symbols.

For compatibility with older 16-bit systems, the UTF-16 system was invented, where the first 65,536 positions, excluding positions in the interval U+D800...U+DFFF, are represented directly as 16-bit numbers, and the rest are represented as "surrogate pairs" "(the first element of the pair from the area U+D800...U+DBFF, the second element of the pair from the area U+DC00...U+DFFF). For surrogate pairs, part of the code space (2048 positions) previously reserved for “characters for private use” was used.

Since UTF-16 can only display 2 20 + 2 16 −2048 (1 112 064) characters, this number was chosen as the final value of the Unicode code space.

Although the Unicode code area was extended beyond 2 16 as early as version 2.0, the first characters in the "top" area were not placed until version 3.1.

The role of this encoding in the web sector is constantly growing; at the beginning of 2010, the share of websites using Unicode was about 50%.

Unicode versions

As the character table of the Unicode system changes and is replenished and new versions of this system are released - and this work is ongoing, since initially the Unicode system included only Plane 0 - double-byte codes - new ISO documents are also released. The Unicode system exists in a total of the following versions:

1.1 (corresponds to ISO/IEC 10646-1:), 1991-1995 standard.
2.0, 2.1 (the same ISO/IEC 10646-1:1993 standard plus additions: “Amendments” 1 to 7 and “Technical Corrigenda” 1 and 2), 1996 standard.
3.0 (ISO/IEC 10646-1:2000 standard), 2000 standard.
3.1 (standards ISO/IEC 10646-1:2000 and ISO/IEC 10646-2:2001), 2001 standard.
3.2, 2002 standard.
4.0, standard.
4.01, standard.
4.1, standard.
5.0, standard.
5.1, standard.
5.2, standard.
6.0, standard.
6.1, standard.
6.2, standard.

Code space

Although the UTF-8 and UTF-32 notations allow up to 2,31 (2,147,483,648) code positions to be encoded, it was decided to use only 1,112,064 for compatibility with UTF-16. However, even this is more than enough - today (in version 6.0) just under 110,000 code positions are used (109,242 graphic and 273 other symbols).

Code space is divided into 17 planes 2 16 (65536) characters each. The ground plane is called basic, it contains symbols of the most commonly used scripts. The first plane is used mainly for historical writings, the second for rarely used KKY characters, the third is reserved for archaic Chinese characters. Planes 15 and 16 are allocated for private use.

To denote Unicode characters, a notation like “U+” is used xxxx"(for codes 0...FFFF), or "U+ xxxxx"(for codes 10000...FFFFF), or "U+ xxxxxx"(for codes 100000...10FFFF), where xxx- hexadecimal digits. For example, the character "I" (U+044F) has the code 044F = 1103.

Coding system

The Universal Coding System (Unicode) is a set of graphic characters and a method of encoding them for computer processing of text data.

Graphic symbols are symbols that have a visible image. Graphic characters are contrasted with control characters and formatting characters.

Graphic symbols include the following groups:

letters contained in at least one of the supported alphabets;
numbers;
punctuation marks;
special signs (mathematical, technical, ideograms, etc.);
separators.

Unicode is a system for linear representation of text. Characters that have additional superscript or subscript elements can be represented as a sequence of codes constructed according to certain rules (composite character) or as a single character (monolithic variant, precomposed character).

Modifying characters

Representation of the symbol “И” (U+0419) as the base character “И” (U+0418) and the modifying character “ ̆” (U+0306)

Graphic characters in Unicode are divided into extended and non-extended (widthless). Non-extended characters do not take up space in the line when displayed. These include, in particular, accent marks and other diacritics. Both extended and non-extended symbols have their own codes. Extended symbols are otherwise called basic symbols. base characters), and non-extended - modifying (eng. combining characters); Moreover, the latter cannot meet independently. For example, the character "á" can be represented as a sequence of the base character "a" (U+0061) and the modifying character "́" (U+0301), or as a monolithic character "á" (U+00C1).

A special type of modifying characters are style variant selectors. variation selectors). They only affect those symbols for which such variants are defined. In version 5.0, style variants are defined for a number of mathematical symbols, for symbols of the traditional Mongolian alphabet, and for symbols of the Mongolian square script.

Forms of normalization

Since the same characters can be represented by different codes, which sometimes makes processing difficult, there are normalization processes designed to reduce text to a certain standard form.

The Unicode standard defines 4 forms of text normalization:

The normalization form D (NFD) is canonical decomposition. In the process of bringing the text into this form, all compound characters are recursively replaced by several compound ones, in accordance with the decomposition tables.
The normalization form of C (NFC) is canonical decomposition followed by canonical composition. First, the text is reduced to form D, after which a canonical composition is performed - the text is processed from beginning to end and the following rules are followed:
- The symbol S is initial, if it has modification class zero in the Unicode character base.
- In any sequence of characters starting with the initial character S, the character C is blocked from S if and only if between S and C there is any character B that is either the initial character or has the same or greater modification class than C. This is the rule applies only to strings that have undergone canonical decomposition.
- Primary A composite is a character that has a canonical decomposition in the Unicode character base (or a canonical decomposition for Hangul and is not included in the exception list).
- A character X can be primary composited with a character Y if and only if there is a primary composite Z that is canonically equivalent to the sequence .
- If the next symbol C is not blocked by the last encountered initial base symbol L and it can be successfully primary combined with it, then L is replaced by the L-C composite, and C is removed.
Normalization Form KD (NFKD) is a compatible decomposition. When cast to this form, all constituent characters are replaced using both canonical Unicode decomposition maps and compatible decomposition maps, and the result is then put into canonical order.
Normalization Form KC (NFKC) - compatible decomposition followed by canonical composition.

The terms “composition” and “decomposition” mean, respectively, the connection or decomposition of symbols into their component parts.

Examples

Original text	NFD	NFC	NFKD	NFKC
Français	Franc\u0327ais	Fran\xe7ais	Franc\u0327ais	Fran\xe7ais
A, E, J		\u0410, \u0401, \u0419	\u0410, \u0415\u0308, \u0418\u0306	\u0410, \u0401, \u0419
が	\u304b\u3099	\u304c	\u304b\u3099	\u304c
Henry IV	Henry IV	Henry IV	Henry IV	Henry IV
Henry Ⅳ	Henry\u2163	Henry\u2163	Henry IV	Henry IV

Bidirectional writing

The Unicode standard supports languages written in both left-to-right writing directions. left-to-right, LTR), and with writing from right to left (English. right-to-left, RTL) - for example, Arabic and Hebrew writing. In both cases, characters are stored in a "natural" order; their display taking into account the desired direction of writing is provided by the application.

In addition, Unicode supports combined texts that combine fragments with different writing directions. This feature is called bidirectionality(English) bidirectional text, BiDi). Some lightweight text renderers (such as those in cell phones) may support Unicode but not support bidirectionality. All Unicode characters are divided into several categories: those written from left to right, those written from right to left, and those written in any direction. The last category of characters (mostly punctuation marks) take the direction of the surrounding text when displayed.

Featured characters

Unicode includes virtually all modern scripts, including:

and others.

Many historical scripts have been added for academic purposes, including: runes, ancient Greek, Egyptian hieroglyphs, cuneiform, Mayan script, Etruscan alphabet.

Unicode provides a wide range of mathematical and musical symbols and pictograms.

However, Unicode generally does not include company and product logos, although they do appear in fonts (for example, the Apple logo in MacRoman (0xF0) or the Windows logo in Wingdings (0xFF)). In Unicode fonts, logos should only be placed in the custom character area.

ISO/IEC 10646

The Unicode Consortium works closely with the ISO/IEC/JTC1/SC2/WG2 working group that is developing International Standard 10646 (ISO/IEC 10646). There is synchronization between the Unicode standard and ISO/IEC 10646, although each standard uses its own terminology and documentation system.

Collaboration between the Unicode Consortium and the International Organization for Standardization. International Organization for Standardization, ISO ) started in 1991. In 1993, ISO released the DIS 10646.1 standard. To synchronize with it, the Consortium approved the Unicode standard version 1.1, which included additional characters from DIS 10646.1. As a result, the meanings of the encoded characters in Unicode 1.1 and DIS 10646.1 completely coincided.

Subsequently, cooperation between the two organizations continued. In 2000, the Unicode 3.0 standard was synchronized with ISO/IEC 10646-1:2000. The upcoming third version of ISO/IEC 10646 will be synchronized with Unicode 4.0. Perhaps these specifications will even be published as a single standard.

Similar to the UTF-16 and UTF-32 formats in the Unicode standard, the ISO/IEC 10646 standard also has two basic forms of character encoding: UCS-2 (2 bytes per character, similar to UTF-16) and UCS-4 (4 bytes per character, similar to UTF-32). UCS means universal multi-octet(multi-byte) encoded character set(English) universal multiple-octet coded character set ). UCS-2 can be considered a subset of UTF-16 (UTF-16 without surrogate pairs), and UCS-4 is a synonym for UTF-32.

Presentation methods

Unicode has several forms of representation. Unicode transformation format, UTF ): UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE). A UTF-7 representation form for transmission over seven-bit channels was also developed, but due to incompatibility with ASCII it was not widely used and is not included in the standard. On April 1, 2005, two humorous representation forms were proposed: UTF-9 and UTF-18 (RFC 4042).

Unicode UTF-8: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0 x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Theoretically possible, but also not included in the standard:

0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Although UTF-8 allows you to specify the same character in several ways, only the shortest one is correct. Other forms should be rejected for security reasons.

Byte order

In a UTF-16 data stream, the high byte can be written either before the low byte. UTF-16 big-endian), or after the younger one (English) UTF-16 little-endian). Similarly, there are two options for four-byte encoding - UTF-32BE and UTF-32LE.

To determine the Unicode representation format, a signature is written to the beginning of the text file - the character U+FEFF (zero-width non-breaking space), also called byte order mark(English) byte order mark, BOM ). This makes it possible to differentiate between UTF-16LE and UTF-16BE since the U+FFFE character does not exist. This method is also sometimes used to denote the UTF-8 format, although the concept of byte order is not applicable to this format. Files following this convention begin with the following byte sequences:

UTF-8 EF BB BF UTF-16BE FE FF UTF-16LE FF FE UTF-32BE 00 00 FE FF UTF-32LE FF FE 00 00

Unfortunately, this method does not reliably distinguish between UTF-16LE and UTF-32LE, since the U+0000 character is allowed by Unicode (although real text rarely begins with it).

UTF-16 and UTF-32 encoded files that do not contain a BOM must be in big-endian (unicode.org) byte order.

Unicode and traditional encodings

The introduction of Unicode led to a change in approach to traditional 8-bit encodings. If earlier the encoding was specified by the font, now it is specified by a table of correspondence between the given encoding and Unicode. In fact, 8-bit encodings have become a form of representation for a subset of Unicode. This has made it much easier to create programs that need to work with many different encodings: now, to add support for another encoding, you just need to add another Unicode conversion table.

In addition, many data formats allow you to insert any Unicode characters, even if the document is written in the old 8-bit encoding. For example, in HTML you can use codes with an ampersand.

Implementations

Most modern operating systems provide some degree of Unicode support.

One of the first successful commercial implementations of Unicode was the Java programming environment. It fundamentally abandoned the 8-bit representation of characters in favor of 16-bit. Most programming languages now support Unicode strings, although their representation may vary depending on the implementation.

Input methods

The GNU/Linux console also allows you to enter a Unicode character by its code - to do this, the decimal code of the character must be entered using the numbers of the extended keyboard block while holding down the Alt key. You can enter characters by their hexadecimal code: to do this, you need to hold down the AltGr key, and to enter numbers A-F, use the extended keyboard block keys from NumLock to Enter (clockwise). Input in accordance with ISO 14755 is also supported. In order for the above methods to work, you need to enable Unicode mode in the console by calling unicode_start (1) and select the appropriate font by calling setfont (8).

The spelling “Unicode” has already become firmly established in Russian-language texts. According to Yandex, the frequency of use of this word is approximately 11 times higher than Unicode. Wikipedia uses a more common version.

The Consortium website has a special page that discusses the problems of transmitting the word “Unicode” in various languages and writing systems. For the Russian Cyrillic alphabet, the "Unicode" option is indicated.

The forms adopted by foreign organizations for the Russian translation of the word “Unicode” are advisory.

Notes

Unicode Transcriptions (English). Archived from the original on August 22, 2011. Retrieved May 10, 2010.
Unicode in the Paratype dictionary
The Unicode® Standard: A Technical Introduction. Archived
History of Unicode Release and Publication Dates. Archived from the original on August 22, 2011. Retrieved July 4, 2010.
The Unicode Consortium. Archived from the original on August 22, 2011. Retrieved July 4, 2010.
Foreword. Archived from the original on August 22, 2011. Retrieved July 4, 2010.
General Structure. Archived from the original on August 22, 2011. Retrieved July 5, 2010.
European Alphabetic Scripts. Archived from the original on August 22, 2011. Retrieved July 4, 2010.
Unicode 88. Archived from the original on August 22, 2011. Retrieved July 8, 2010.
Unicode and Microsoft Windows NT (English). Microsoft Support. Archived
Unicode is used on almost 50% of websites (Russian). Archived from the original on August 22, 2011.
Roadmap to the TIP (Tertiary Ideographic Plane)
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt (English)
Unicode case is not easy
Most PC fonts implement "uppercase" (majuscule) monospace digits.
In some cases, a document (not plain text) in Unicode can take up significantly less space than a document in a single-byte encoding. For example, if a certain web page contains approximately equal numbers of Russian and Greek text, then in a single-byte encoding you will have to write either Russian or Greek letters, using the capabilities of the document format, in the form of codes with an ampersand, which occupy 6-7 bytes per character (when using decimal codes), i.e. on average, a letter will take 3.5-4 bytes, while UTF-8 takes only 2 bytes per Greek or Russian letter.
One of the Arial Unicode font files is 24 megabytes in size; There is a Times New Roman that is 120 megabytes in size and contains close to 65,536 characters.
Even for the most modern and expensive mobile phone it is difficult to allocate 120 MB of memory for a full Unicode font. In practice, the use of full fonts is rarely required.
350 thousand “Unicode” pages versus 31 thousand “Unicode” pages.

Links

Official site of the Unicode Consortium (English)
Unicode in the Open Directory Project link directory (dmoz). (English)
What is Unicode? (Russian)
Latest version of the Unicode standard
Table of Unicode characters with names and descriptions (Russian) (English)
Relationship between Unicode and ISO/IEC 10646 (PDF file)
FAQ on UTF-8 and Unicode (English)
Cyrillic in Unicode:

When trying to configure one or another Internet function, any user has probably come across such a concept as “Unicode”. To find out what this concept means, read this article to the end.

Unicode: Definition

The term “Unicode” today refers to a character encoding standard. This standard was proposed in 1991 by the non-profit organization Unicode Inc. The Unicode standard was developed to combine a large number of different characters in a single document. A page created based on such encoding may contain hieroglyphs, letters, and mathematical symbols. This encoding displays all characters without problems.

"Unicode": reasons for creation

Long before the advent of the Unicode system, encodings were chosen based on the preferences of the author of the document. Often for this reason, in order to read one document, it was necessary to use different tables. However, this had to be done several times. This made life significantly more difficult for ordinary users. As mentioned earlier, in 1991, to solve this problem, the non-profit organization Unicode Inc. proposed to use a new type of information coding. This type of coding was created to combine a wide variety of standards. Unicode made it possible to achieve the impossible: create a tool that supports a huge variety of characters. The result exceeded expectations: documents were obtained that could simultaneously contain both Russian and English text, as well as mathematical expressions and Latin. Before creating a unified encoding system, developers had to solve a number of problems arising from the huge number of standards that already existed at the moment. The most common of these problems included limited character sets, Elvish writing, duplicate fonts, and problems converting between different encodings.

"Unicode": an excursion into history

Imagine the following picture: it’s the 80s, computer technology has not yet become so widespread and looks different from today. Each operating system is unique in its own way and modified by enthusiasts for certain specific needs. As a result, the need for information exchange led to additional modifications. When trying to read a document created on another operating system, strange sets of characters were usually displayed on the screen. This required further work with coding, which could not always be completed quickly. Sometimes it took several months to process the required document. Users who often need to exchange information began to create special conversion tables for themselves. Working with such tables revealed one interesting feature: it is necessary to create such tables simultaneously in two directions. The machine cannot perform a banal inversion of calculations. For it, the source file is written in the right column, and the result in the left column. On the contrary, they cannot be rearranged. If it was necessary to use some special characters in a document, first they had to be added, and then another user had to explain what needed to be done with them so that they did not turn into “crackers”. It is also worth considering that we had to develop our own fonts for each encoding. This led to the creation of a huge number of duplicates in the operating system. So, for example, on one page the user could see a dozen fonts identical to the standard Times New Roman, but marked UCS-2, UTF-16, UTF-8, ANSI. Thus, there is a need to develop a universal standard.

Unicode: creators

The beginning of the history of the creation of Unicode can be dated back to 1987. It was then that Joe Becker of Xerox, along with Mark Davis and Lee Collins of Apple, began research into the practical development of a universal encoding. In 1988, Joe Becker published a project to create an international multilingual encoding. A few months later, the Unicode development team was expanded. It included specialists such as Glenn Wright from Sun Microsystems, Mike Kernegan and Ken Whistler from RLG. This made it possible to complete work on the preliminary formation of a unified coding standard.

Unicode: general description

Unicode encoding is based on the general concept of a character. This definition refers to an abstract phenomenon that exists in the form of writing, realized through graphemes. In Unicode, each character is associated with a unique code that belongs to one or another block of the standard. So, for example, the grapheme “B” is present in both English and Russian languages, but it corresponds to two different symbols. These characters can also be converted to lowercase. This means that each of these symbols is described by a key, a set of properties, and a name.

Unicode: advantages

Unicode differs from other modern encoding systems in its huge supply of characters for “encrypting” various characters. The thing is that previous encodings had only 8 bits. This means they only supported 28 characters. The new development had 216 characters, which was a big step forward. Thus, it became possible to encode almost all existing alphabets. The need to use conversion tables with the advent of Unicode disappeared. Having a single standard simply reduced their usefulness to zero. At the same time, the “kryakozyabra” also disappeared. The emergence of a new standard made their existence impossible. The need to create duplicate fonts was also eliminated.

"Unicode": development

Despite the fact that progress does not stand still, the Unicode encoding continues to hold a leading position in the world. This became possible largely due to the fact that it became easy to implement and became widespread. However, one should not assume that the same Unicode encoding is used today as 25 years ago. Today version 5.x.x is used. The number of encoded characters has increased to 231. From its inception to the advent of version 2.0.0, the Unicode encoding has almost doubled the number of characters included in it. This growth in opportunities continued in subsequent years. By the time version 4.0.0 appeared, there was a need to increase the standard itself. As a result, the Unicode encoding took on the form in which we know it today.

What else is useful in Unicode? In addition to the huge, constantly growing number of characters, the Unicode encoding has one rather useful feature. This is normalization. The encoding does not waste computer resources on regularly checking the same character, which may have a similar spelling in different alphabets. For this purpose, a special algorithm is used, which makes it possible to display similar symbols separately in a column and access them, rather than checking all the information each time. A total of four such algorithms have been developed and implemented. The transformation in each of them is carried out according to a certain principle, different from the others.

A number of numbers and letters have an outwardly similar style, which is indistinguishable with a small font size. For example, the numbers "0", "1" and the letters "O", "l" (L). This is a serious problem, especially in cases where a strictly unambiguous reading of symbols is necessary. For example, when writing down your alphanumeric password with a pen on a piece of paper or printing your alphanumeric password on a printer. The first programmers and font designers had to solve this problem (in the 20th century, at the very beginning of the computer era). For a long time now, special contrasting fonts have appeared, such as Inconsolata, Consolas (system in OS Windows), Anonymous Pro, Deja Vu Sans Mono and many others. Some of them can be downloaded for free using links from the websites of their authors and creators and from specialized Internet resources.
See example:
http://www.levien.com/type/myfonts/inconsolata.html

If allowed by technical conditions and design specifications, then instead of a digital zero, “Ø” (latin capital letter O with stroke, from a modification of the Latin alphabet for the Scandinavian languages - Norwegian and Danish) is put in the HTML code, approximately similar in its in outline, on a zero crossed in half. In a text editor, such an icon is taken, copied from the Special Character table, and inserted into the desired position in the line. This life hack will be useful if you have difficulties finding and installing a special font on your device. This tip will save time and prevent you from confusing the number “0” (zero) with the letter “O” not only on your PC monitor, but also on the screens of other devices where the required font may not be available. This form of recording is traditionally used when indicating mixed alphanumeric information on a sheet of paper, for example, your password or access code. It is noteworthy that there is even a humorous expression "", emphasizing the importance of the presence of this element, which gives the symbol a certain meaning and significance. Graphic view of zero in different types of fonts - you can view and compare their images in pictures using a specialized service on the website page:
http://www.fileformat.info/info/unicode/char/0030/fontsupport.htm

Figure 2

When editing and editing text, crossing out an incorrectly written or unnecessary character is done with a large oblique cross (two crisscrossing diagonal strokes of equal length). In a text editor, this is done by means of formatting - first, select a fragment, and then press a sequence of buttons and tabs in the menu (Format - Character - Font Effects - Strikethrough) to select the desired effect from the drop-down lists. Crossing out one or more words in a line or in an entire paragraph of a document is done using a horizontal single or double line of sufficient thickness.

If you need to find out exactly what is written in the text - a letter or a number, then you can, in the search mode on the page, set the desired symbol and make sure that it will be found exactly there.

The standard was proposed in 1991 by the non-profit organization Unicode Consortium (Unicode Inc.). The use of this standard allows you to encode a very large number of characters from different scripts: Unicode documents can contain Chinese characters, mathematical symbols, letters of the Greek alphabet, Latin and Cyrillic alphabet, and switching code pages becomes unnecessary.

The standard consists of two main sections: the universal character set (UCS) and the encoding family (UTF, Unicode transformation format). The universal character set specifies a one-to-one correspondence between characters and codes - elements of the code space representing non-negative integers. An encoding family defines the machine representation of a sequence of UCS codes.

The Unicode standard was developed to create a single character encoding for all modern and many ancient written languages. Each character in this standard is encoded with 16 bits, which allows it to cover an incomparably larger number of characters than previously accepted 8-bit encodings. Another important difference between Unicode and other encoding systems is that it not only assigns a unique code to each character, but also defines various characteristics of that character, for example:

Character type (uppercase letter, lowercase letter, number, punctuation mark, etc.);

Character attributes (display from left to right or right to left, space, line break, etc.);

The corresponding uppercase or lowercase letter (for lowercase and uppercase letters, respectively);

The corresponding numeric value (for numeric characters).

The entire range of codes from 0 to FFFF is divided into several standard subsets, each of which corresponds either to the alphabet of a language or to a group of special characters that are similar in their functions. The diagram below contains a general list of Unicode 3.0 subsets (Figure 2).

Figure 2

The Unicode standard is the basis for storing text in many modern computer systems. However, it is not compatible with most Internet protocols because its codes can contain any byte values, and protocols typically use bytes 00 - 1F and FE - FF as service bytes. To achieve compatibility, several Unicode Transformation Formats (UTFs) have been developed, of which UTF-8 is by far the most common. This format defines the following rules for converting each Unicode code into a set of bytes (one to three) suitable for transport by Internet protocols.

Here x,y,z denote the bits of the source code that should be extracted, starting with the least significant one, and entered into the result bytes from right to left until all specified positions are filled.

Further development of the Unicode standard is associated with the addition of new language planes, i.e. characters in the ranges 10000 - 1FFFF, 20000 - 2FFFF, etc., where it is supposed to include encoding for scripts of dead languages that are not included in the table above. A new format, UTF-16, was developed to encode these additional characters.

So there are 4 main ways to encode Unicode bytes:

UTF-8: 128 characters encoded in one byte (ASCII format), 1920 characters encoded in 2 bytes ((Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters), 63488 characters encoded in 3 bytes (Chinese, Japanese etc.) The remaining 2,147,418,112 characters (not yet used) can be encoded with 4, 5 or 6 bytes.

UCS-2: Each character is represented by 2 bytes. This encoding includes only the first 65,535 characters from the Unicode format.

UTF-16: An extension of UCS-2, it contains 1,114,112 Unicode format characters. The first 65,535 characters are represented by 2 bytes, the rest by 4 bytes.

USC-4: Each character is encoded in 4 bytes.

Believe it or not, there is a format for images built into the browser. This format allows you to load images before they are needed, renders images on regular or retina screens, and allows you to add CSS to images. OK, that's not entirely true. It's not an image format, although everything else still applies. Using it, you can create icons that are resolution independent, require no loading time, and are styled using CSS.

What is Unicode?

Unicode is the ability to correctly display letters and punctuation marks from different languages on one page. It's incredibly useful: users will be able to interact with your site around the world and it will show what you want - it could be French with accents or Kanji.

Unicode continues to develop: version 8.0 is now current, which contains more than 120 thousand characters (in the original article published in early 2014, we were talking about version 6.3 and 110 thousand characters).

In addition to letters and numbers, Unicode also has other symbols and icons. In the latest versions, these included emoji, which you can see in the iOS messenger.

HTML pages are created from a sequence of Unicode characters and are converted into bytes when sent over the network. Each letter and each symbol of any language has its own unique code and is encoded when the file is saved.

When using the UTF-8 encoding system, you can directly insert Unicode characters into text, but you can also add them to text by specifying a numeric symbolic link. For example, this is a heart symbol and you can output this symbol by simply adding code to the markup.

This numeric reference can be specified in either decimal or hexadecimal format. The decimal format requires adding an x at the beginning, which will produce the same heart ( ) as the previous option. (2665 is the hexadecimal version of 9829).

If you add a Unicode character using CSS, then you can only use hexadecimal values.

Some of the most commonly used Unicode characters have more memorable text names or abbreviations instead of numeric codes—for example, the ampersand (& - &). Such symbols are called mnemonics in HTML, their complete list is on Wikipedia.

Why should you use Unicode?

Good question, here are some reasons:

To use correct characters from different languages.
To replace icons.
To replace icons connected via @font-face .
To set CSS classes

Valid characters

The first reason does not require any additional action. If the HTML is saved as UTF-8 and its encoding is transmitted over the network as UTF-8, everything should work as expected.

Must. Unfortunately, not all browsers and devices support all Unicode characters equally (more precisely, not all fonts support the full character set). For example, recently added emoji characters are not supported everywhere.

To support UTF-8 in HTML5 add (if you don’t have access to the server settings, you should also add ). With the old doctype, ( ).

Icons

The second reason for using Unicode is that it has a large number of useful characters that can be used as icons. For example, , ≡ and .

Their obvious advantage is that you don’t need any additional files to add them to the page, which means your site will be faster. You can also change their color or add a shadow using CSS. And by adding transitions (css transition), you can smoothly change the color of the icon when you hover over it without any additional images.

Let's say I want to include a star rating indicator on my page. I can do it like this:

★ ★ ★ ☆ ☆

You will get the following result:

But if you're unlucky, you'll see something like this:

Same rating on BlackBerry 9000

This happens if the characters you're using aren't in the browser or device's font (luckily, these asterisks are well supported, and older BlackBerry phones are the only exception here).

If a Unicode character is missing, it may be replaced by a variety of characters, ranging from an empty square (□) to a diamond with a question mark (�).

How do you find a Unicode character that might be suitable for use in your design? You can look it up on a site like Unicodinator, browsing through the available characters, but there is a better option. - This great site allows you to draw the icon you're looking for and then offers you a list of similar Unicode characters.

Using Unicode with @font-face icons

If you are using icons connected to an external font via @font-face , Unicode characters can be used as a fallback. This way you can show a similar Unicode character on those devices or browsers where @font-face is not supported:

On the left are the Font Awesome icons in Chrome, and on the right are the Unicode characters that replace them in Opera Mini.

Many @font-face guessing tools use a range of Unicode characters from the private use area. The problem with this approach is that if @font-face is not supported, character codes are passed to the user without any meaning.

Great for creating icon sets in @font-face and allows you to choose a suitable Unicode character as the basis for the icon.

But be careful - some browsers and devices don't like single Unicode characters when used with @font-face . It's worth checking out Unicode character support with Unify - this app will help you determine if a character is safe to use in the @font-face icon set.

Unicode character support

The main problem with using Unicode characters as a fallback is poor support in screen readers (again, some information about this can be found on Unify), so it's important to choose the characters you use carefully.

If your icon is just a decorative element next to a text label that is readable by a screen reader, you don't have to worry too much. But if the icon is placed separately, it's worth adding a hidden text label to help screen reader users. Even if a Unicode character is read by a screen reader, there is a chance that it will be very different from its intended purpose. For example, ≡ (≡) as a hamburger icon will be read as “identical” by VoiceOver on iOS.

Unicode in CSS class names

It has been known since 2007 that Unicode can be used in class names and style sheets. That's when Jonathan Snook wrote about using Unicode characters in helper classes when laying out rounded corners. This idea has not gained much popularity, but it is worth knowing about the possibility of using Unicode in class names (special characters or Cyrillic).

Font selection

Very few fonts support the full set of Unicode characters, so when choosing a font, make sure it has the characters you need.

Lots of icons in Segoe UI Symbol or Arial Unicode MS. These fonts are available on both PC and Mac; Lucida Grande also has a sufficient number of Unicode characters. You can add these fonts to the font-family declaration to ensure that the maximum number of Unicode characters is available for users who have these fonts installed.

Defining Unicode Support

It would be very convenient to be able to check for the presence of a particular Unicode character, but there is no guaranteed way to do this.

Unicode characters can be effective if supported. For example, an emoji in the subject line of an email makes it stand out from the rest in the inbox.

Conclusion

This article only covers the basics of Unicode. I hope you find it useful and help you better understand Unicode and use it effectively.

List of links

(Unicode based @font-face icon set generator)
Shape Catcher (Unicode character recognition tool)
Unicodinator (Unicode character table)
Unify (Checking Unicode character support in browsers)
Unitools (Collection of tools for working with Unicode)