Utf encoding 8 change php. Determining the encoding of text in PHP - an overview of existing solutions, plus one more bike

Youtube

Faced a task - autodetecting the encoding of the page / text / whatever. The problem is not new, and a lot of bicycles have already been invented. The article contains a small overview of what was found on the net - plus a proposal of my own, as it seems to me, a worthy solution.

1. Why not mb_detect_encoding ()?

In short, it doesn't work.

Let's take a look:
// At the entrance - Russian text encoded in CP1251 $ string = iconv ("UTF-8", "Windows-1251", "He went up to Anna Pavlovna, kissed her hand, substituting his perfumed and radiant bald head for her, and calmly sat down on sofa. "); // Let's see what md_detect_encoding () gives us. First $ strict = FALSE var_dump (mb_detect_encoding ($ string, array ("UTF-8"))); // UTF-8 var_dump (mb_detect_encoding ($ string, array ("UTF-8", "Windows-1251"))); // Windows-1251 var_dump (mb_detect_encoding ($ string, array ("UTF-8", "KOI8-R"))); // KOI8-R var_dump (mb_detect_encoding ($ string, array ("UTF-8", "Windows-1251", "KOI8-R"))); // FALSE var_dump (mb_detect_encoding ($ string, array ("UTF-8", "ISO-8859-5"))); // ISO-8859-5 var_dump (mb_detect_encoding ($ string, array ("UTF-8", "Windows-1251", "KOI8-R", "ISO-8859-5"))); // ISO-8859-5 // Now $ strict = TRUE var_dump (mb_detect_encoding ($ string, array ("UTF-8"), TRUE)); // FALSE var_dump (mb_detect_encoding ($ string, array ("UTF-8", "Windows-1251"), TRUE)); // FALSE var_dump (mb_detect_encoding ($ string, array ("UTF-8", "KOI8-R"), TRUE)); // FALSE var_dump (mb_detect_encoding ($ string, array ("UTF-8", "Windows-1251", "KOI8-R"), TRUE)); // FALSE var_dump (mb_detect_encoding ($ string, array ("UTF-8", "ISO-8859-5"), TRUE)); // ISO-8859-5 var_dump (mb_detect_encoding ($ string, array ("UTF-8", "Windows-1251", "KOI8-R", "ISO-8859-5"), TRUE)); // ISO-8859-5
As you can see, the output is a complete mess. What do we do when it’s not clear why a function behaves this way? That's right, we google. Found a great answer.

To finally dispel all hopes for using mb_detect_encoding (), you need to get into the source code of the mbstring extension. So, rolled up our sleeves, let's go:
// ext / mbstring / mbstring.c: 2629 PHP_FUNCTION (mb_detect_encoding) (... // line 2703 ret = mbfl_identify_encoding_name (& string, elist, size, strict); ...
Ctrl + click:
// ext / mbstring / libmbfl / mbfl / mbfilter.c: 643 const char * mbfl_identify_encoding_name (mbfl_string * string, enum mbfl_no_encoding * elist, int elistsz, int strict) (const mbfl_encoding * encoding; encoding = mbflist_string , strict); ...
Ctrl + click:
// ext / mbstring / libmbfl / mbfl / mbfilter.c: 557 / * * identify encoding * / const mbfl_encoding * mbfl_identify_encoding (mbfl_string * string, enum mbfl_no_encoding * elist, int elistsz, int strict) (...
I will not post the full text of the method, so as not to litter the article with unnecessary sources. Who is interested in seeing it for themselves. We are tormented by the line at number 593, where, in fact, it is checked whether the character is suitable for the encoding:
// ext / mbstring / libmbfl / mbfl / mbfilter.c: 593 (* filter-> filter_function) (* p, filter); if (filter-> flag) (bad ++;)
Here are the basic filters for single-byte Cyrillic:

Windows-1251 (original comments preserved)
// ext / mbstring / libmbfl / filters / mbfilter_cp1251.c: 142 / * all of this is so ugly now! * / static int mbfl_filt_ident_cp1251 (int c, mbfl_identify_filter * filter) (if (c> = 0x80 && c< 0xff) filter->flag = 0; else filter->

KOI8-R
// ext / mbstring / libmbfl / filters / mbfilter_koi8r.c: 142 static int mbfl_filt_ident_koi8r (int c, mbfl_identify_filter * filter) (if (c> = 0x80 && c< 0xff) filter->flag = 0; else filter-> flag = 1; / * not it * / return c; )

ISO-8859-5 (it's all fun here)
// ext / mbstring / libmbfl / mbfl / mbfl_ident.c: 248 int mbfl_filt_ident_true (int c, mbfl_identify_filter * filter) (return c;)
As you can see, ISO-8859-5 always returns TRUE (to return FALSE, you need to set filter-> flag = 1).

When we looked at the filters, everything fell into place. CP1251 is indistinguishable from KOI8-R. If ISO-8859-5 is in the list of encodings, it will always be detected as correct.

In general, fail. It is understandable - only by character codes it is impossible to find out the encoding in the general case, since these codes intersect in different encodings.

2. What Google gives out

And Google gives out all sorts of squalor. I won't even post the sources here, take a look yourself if you want (remove the space after http: //, I don't know how to show the text without a link):

Http: // deer.org.ua/2009/10/06/1/
http: // php.su/forum/topic.php?forum=1&topic=1346

3. Search by habr

1) again character codes: habrahabr.ru/blogs/php/27378/#comment_710532

2) in my opinion, a very interesting solution: habrahabr.ru/blogs/php/27378/#comment_1399654
Cons and pros in the comment on the link. Personally, I think that this solution is redundant only for encoding detection - it turns out too powerful. Determining the encoding in it is a side effect).

4. Actually, my decision

The idea came up while looking at the second link from the previous section. The idea is as follows: we take a large Russian text, measure the frequencies of different letters, and use these frequencies to detect the encoding. Looking ahead, I will say right away - there will be problems with capital and small letters. Therefore, I post examples of letter frequencies (let's call it “spectrum”), both case-sensitive and caseless (in the second case, I added an even larger letter to the small letter with the same frequency, and removed all the large ones). In these "spectra" all letters with frequencies less than 0.001 and a space are cut out. Here's what I got after editing War and Peace:

Case-sensitive "spectrum":
array ("o" => 0.095249209893009, "e" => 0.06836817536026, "a" => 0.067481298384992, "u" => 0.055995027400041, "n" => 0.052242744063325, .... "e" => 0.002252892226507, "H "=> 0.0021318391371162," P "=> 0.0018574762967903," f "=> 0.0015961610948418," B "=> 0.0014044332975731," O "=> 0.0013188987793209," A "=> 0.0012623590130186," K "=>" 0.001180 => 0.001061932790165,)

Case insensitive:
array ("O" => 0.095249209893009, "o" => 0.095249209893009, "E" => 0.06836817536026, "e" => 0.06836817536026, "A" => 0.067481298384992, "a" => 0.067481298384992, "I" => 0.0559950274000 , "and" => 0.055995027400041, .... "C" => 0.0029893589260344, "c" => 0.0029893589260344, "u" => 0.0024649163501406, "U" => 0.0024649163501406, "E" => 0.002252892226507, "e" => 0.002252892226507, "F" => 0.0015961610948418, "f" => 0.0015961610948418,)

Spectra in different encodings (array keys are the codes of the corresponding characters in the corresponding encoding):

Further. We take a text of an unknown encoding, for each tested encoding we find the frequency of the current character and add it to the “rating” of this encoding. An encoding with a higher rating is, most likely, a text encoding.

$ encodings = array ("cp1251" => require "specter_cp1251.php", "koi8r" => require "specter_koi8r.php", "iso88595" => require "specter_iso88595.php"); $ enc_rates = array (); for ($ i = 0; $ i< len($str); ++$i) { foreach ($encodings as $encoding =>$ char_specter) ($ enc_rates [$ encoding] + = $ char_specter)]; )) var_dump ($ enc_rates);
Don't even try to execute this code on your own - it won't work. You can think of this as pseudocode - I omitted the details so as not to clutter up the article. $ char_specter are exactly the arrays that are referenced by pastebin.

results

The rows of the table are the text encoding, the columns are the contents of the $ enc_rates array.

1) $ str = "Russian text";
0.441 | 0.020 | 0.085 | Windows-1251
0.049 | 0.441 | 0.166 | KOI8-R
0.133 | 0.092 | 0.441 | ISO-8859-5

Everything is fine. The real encoding already has a 4 times higher rating than the others - this is for such a short text. On longer texts, the ratio will be about the same.

cp1251 | koi8r | iso88595 |
0.013 | 0.705 | 0.331 | Windows-1251
0.649 | 0.013 | 0.201 | KOI8-R
0.007 | 0.392 | 0.013 | ISO-8859-5

Oops! Complete porridge. And because the uppercase letters in CP1251 usually correspond to the lowercase letters in KOI8-R. And small letters, in turn, are used much more often than large ones. So we define the string with caps in CP1251 as KOI8-R.
Trying to do it case insensitive

1) $ str = "Russian text";
cp1251 | koi8r | iso88595 |
0.477 | 0.342 | 0.085 | Windows-1251
0.315 | 0.477 | 0.207 | KOI8-R
0.216 | 0.321 | 0.477 | ISO-8859-5

2) $ str = "LINE CAPPY RUSSIAN TEXT";
cp1251 | koi8r | iso88595 |
1.074 | 0.705 | 0.465 | Windows-1251
0.649 | 1.074 | 0.201 | KOI8-R
0.331 | 0.392 | 1.074 | ISO-8859-5

As you can see, the correct encoding is consistently leading both with case-sensitive "spectra" (if the string contains a small number of capital letters) and with case-insensitive ones. In the second case, with case-insensitive ones, the leader is not so confident, of course, but quite stable even on small lines. You can also play around with the weights of the letters - make them nonlinear with respect to frequency, for example.

5. Conclusion

The topic does not cover working with UTF-8 - there is no fundamental difference here, except that getting character codes and splitting a string into characters will be slightly longer / more complicated.
These ideas can be extended not only to Cyrillic encodings, of course - the only question is in the "spectra" of the corresponding languages / encodings.

P.S. If it will be very necessary / interesting - then I will post the second part of the fully working library on GitHub. Although I believe that the data in the post is quite enough to quickly write such a library and to fit your own needs - the "spectrum" for the Russian language is laid out, it can be easily transferred to all the necessary encodings.

A simple script suddenly stopped working. The task of the script is to get an HTML page (from a browser game) and fetch the data using regular expressions. As a beginner, this event caused bewilderment and slight panic for me: after all, everything worked yesterday! What's the matter?
I had to thoroughly understand how some PHP functions work.

The code was pretty primitive:

$ pattern =; $ url = "http://www.heroeswm.ru/pl_info.php?id=($id)"; $ html = file_get_contents ($ url); preg_match ($ pattern, $ html, $ matches); if (isset ($ matches [1])) echo $ matches [1]; else echo "not found";

$ pattern = "# Write a letter (. *) Battle level # is"; $ url = "http://www.heroeswm.ru/pl_info.php?id=($id)"; $ html = file_get_contents ($ url); preg_match ($ pattern, $ html, $ matches); if (isset ($ matches)) echo $ matches; else echo "not found";

Retrieving data and parsing it using a simple regular expression.
I must say that this code is the result of a slight modification. In the original version, the regular search was based on HTML tags. But I needed to find a piece between two phrases in Russian. I added Russian words to the search template, and it was this change that became critical.

And now, in order.
The site of the game www.heroeswm.ru gives out pages in the encoding win-1251... I have an encoding on the server UTF-8, so all scripts are in UTF-8 without BOM.
The original script with search by HTML tags worked correctly, despite the difference in encodings, but when I added Cyrillic characters to the search template, I stopped searching and finding. In my task, it would be very easy to dismiss the problem and choose another template - without Russian words, but in most cases this is impossible. Therefore, I decided to figure it out thoroughly: what is the fundamental difference between encodings why is she calling incorrect work of regular expressions, and at the same time - what functions are affected because of the difference in encodings, and how to get around it.

To get the data, I used the file_get_contents () function, which has the following syntax:

string file_get_contents ( string$ filename), where $ filename is the name of the file to read.
Returns a string or bool (false) in case of failure to receive data.

The most obvious difference between encodings win-1251 and UTF-8 Is the number of characters that can be encoded with them. The first (and all others like it) are subject to only 255, since each character is encoded in one byte.

With the help of the second, you can transfer a truly huge set of characters, including letters of national alphabets, Arabic letters and hieroglyphs. Such an extension of the character set is achieved due to the fact that characters are encoded not with one, but with two (for most characters) and more (up to four) bytes. Therefore the encoding UTF-8(and the like) are called multi- or multi-byte, as opposed to single-byte, such as win-1251.

With such a vast array of signs, UTF-8 will not only allow the use of letters of different alphabets on one site, but will also give some guarantee that the Russian-language site will be displayed correctly even where the existence of encodings with Cyrillic support ( win-1251, KOI8-R, CP866, ISO 8859-5 and others) do not even suspect: in Japan, Korea, Arab countries, etc. The payment for such versatility will be a slightly higher weight of characters during storage and, accordingly, a longer processing time for string functions. PHP... By the way, they will not work correctly in most cases. This is exactly the problem I ran into: the regular script written in UTF-8, I just could not find the correct substring that I needed, including the Cyrillic characters, on the page received from the site in Windows-1251.

It is logical to assume that sites that will use only Cyrillic and Latin characters will UTF-8 to nothing, and the simple parser under consideration "lived" quite well in win-1251, but there are situations when from the need to make friends with these encodings and use string functions PHP just do not get out, for example, when developing a project in UTF-8.

What caused the incorrect behavior of string functions?

As already mentioned, the main difference between encodings is the length of characters. Therefore, problems arise when using functions in which they work with characters as with bytes, and return values in bytes too (for a single-byte encoding, this is true: one character is equal to one byte).

For example, the function

substr ("Check", 0, 5); // text in UTF-8 encoding

will return "Pr�" instead of the expected "Prove": in UTF-8 Cyrillic characters are encoded in two bytes, as a result of which we see "krakozyabr" - only the first byte of the "o" character.

Thus, in most cases, to work with strings in UTF-8 you will need to use special functions (for example, from the extension PHP mbstring), and sometimes the use of both (for example, to transfer the size of a string in bytes to the HTTP header, you will need to leave strlen (), and to count the number of characters, you will have to add mb_strlen ()).

Syntax for commonly used functions that might need to be replaced with functions from an extension PHP mbString:

int strlen ( string$ string) - returns the length of the string, or 0 if the string is empty.

int strpos ( string$ haystack, mixed$ needle) - returns the position of the first occurrence of $ needle in the substring $ haystack, or FALSE if not found.

stripos is similar to the previous function, only the search is case insensitive.

string substr ( string$ string, int$ start [, int $ length]) - allows you to select a substring starting from the specified character position, and when specifying the third parameter, a certain length.

Their counterparts are designed to work in multibyte encodings: mb_strlen, mb_strpos, mb_stripos and mb_substr.

Of course, there are many more functions for working with text. I have listed only the most popular ones.

The functions for working with regular expressions stand apart, they are designed to find a substring (substrings) that matches the mask specified by the regular expression.

int preg_match ( string$ pattern, string$ subject, array& $ matches)

int preg_match_all ( string$ pattern, string$ subject array& $ matches).

The text of $ subject is searched for matches against the regular expression $ pattern. The search result is written to the $ matches variable. The function returns the number of matches found with the template, in case of an error it will return FALSE.

To use regular expressions in multibyte encodings for pattern matching, you need to add the / u modifier to them, or use the mb_ereg * group of functions.

What to do?

The first come up solution is to recode to UTF-8 received in win-1251 data - seemed inconvenient. After all, after the recoding, all the usual functions will have to be replaced with special ones for working with UTF-8, or try to use the / u modifier (looking ahead, I will say that it allows you to work with strings in single-byte encodings, “as with strings UTF-8", But not valid for lines in UTF-8). In my example, there is only one preg_match (), but in practice this is rarely the case.

Therefore, I "reverse" the task: I want to use the usual functions preg_match (), and for this I will recode not the input string, but the search pattern using iconv ().

Function syntax:
string iconv ($ in_charset, $ out_charset, $ str) - Converts $ str from $ in_charset to $ out_charset. Returns the transcoded text without affecting the original variable.

Function returns a string in the new encoding but does not change the encoding of the string itself. So

$ pattern = "# Write a letter (. *) Battle level # is"; iconv ("UTF-8", "WINDOWS-1251", $ pattern); // $ pattern remains in the original preg_match ($ pattern, $ html, $ matches);

won't work - $ pattern remains in original encoding UTF-8... You need to assign the result of iconv to a variable:

$ pattern = "# Write a letter (. *) Battle level # is"; $ pattern = iconv ("UTF-8", "WINDOWS-1251", $ pattern); preg_match ($ pattern, $ html, $ matches);

Now the search is working correctly, but only the browser sends solid krakozyabry. Well, here I already know what to do: you need to recode the result into a working encoding UTF-8... And then the second point surfaced, which was not obvious to me, although if I had more experience, I probably would not have caused any difficulties: why does iconv () recode some variables and not others?

The variable $ matches is an array, and I tried to get away with one re-encoding of iconv ($ matches). Once again, I look at the description of the syntax of the functions: well, of course, all parameters must be strings, not arrays. That is, it is necessary to iterate over all the values of the array that need to be re-encoded, and translate them into the desired encoding. In my example, I did not iterate over the array, since I was interested in one value, and not the entire array. I also specified it as a parameter of the iconv () function.

Here's what I got in the final version:

// set the default encoding setlocale (LC_ALL, "ru_RU.UTF-8"); header ( "Content-type: text / html; charset = UTF-8"); $ pattern = "# Write a letter (. *) Battle level # is"; $ pattern = iconv ("UTF-8", "WINDOWS-1251", $ pattern); $ url = "http://www.heroeswm.ru/pl_info.php?id=993353"; $ html = file_get_contents ($ url); preg_match ($ pattern, $ html, $ matches); if (isset ($ matches [1])) echo $ matches [1] = iconv ("WINDOWS-1251", "UTF-8", $ matches [1]); else echo "not found"; ?>

The article was written by a good friend of mine. She is engaged in writing and reviewing texts, programming in PHP is more of her hobby. In my blog, she corrects all publications, and this one became her gift to me for the second anniversary of the blog.

No related publications were found.

Let's go back to our HTML page, which we created in the previous lessons, and now we will set the encoding in which its text will be stored.

I would like to tell you about two ways how you can change the text encoding. As a rule, I use them in practice and they have proven themselves well.

The most reliable way to change the text encoding is with the Notepad ++ program. As a rule, this method always works reliably and with its help you can solve the most difficult problems.

1 way. Using Notepad ++

So, in order to change the text encoding, we need a special text editor called notepad ++.

It's free and you can download it from this site:

Open an HTML page using this program and go to the main menu "Encodings".

Select the encoding you want to convert to and save the file.

That's the whole procedure. The program is very good and unlike other alternatives, it changes the encoding flawlessly.

Method 2. Using the Dreamweaver general-purpose code editor.

If you are working in Dreamweaver's general-purpose code editor, there is also the option to specify the encoding in which the text will be displayed.

This can also be done using the main menu "Change - Page Properties".

Next, in the "Name / Encoding" category, select the encoding you need. Most often it will be Unicode (UTF-8) encoding.

When creating a new html-document, this method works fine, but if you change the encoding of an existing file, then it is better to use the first method. In this case, it works better.

Perform this operation on your computer.

But, specifying the text encoding for the html page is not enough yet. For its normal operation, you need to do one more action: tell the browser what encoding the text is written in.

Utf encoding 8 change php. Determining the encoding of text in PHP - an overview of existing solutions, plus one more bike

1. Why not mb_detect_encoding ()?

2. What Google gives out

3. Search by habr

4. Actually, my decision

results

5. Conclusion

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts