Tesseract training program. Developing a document management solution: how we chose an OCR library for our tasks

Facebook

We needed to improve document flow in our company, first of all, to increase the speed of processing paper documents. To do this, we decided to develop a software solution based on one of the OCR (optical character recognition) libraries.

OCR, or optical character recognition, is the mechanical or electronic conversion of images of printed text into machine-readable text. OCR is a way of digitizing printed text so that it can be electronically stored, edited, displayed, and applied to machine processes such as cognitive computing, machine translation, and data mining.

Additionally, OCR is used as a method for capturing information from paper documents (including financial records, business cards, invoices, and more).

Before implementing the application itself, we conducted a thorough analysis of the three most popular OCR libraries in order to determine the most suitable option for solving our problems.

We analyzed the three most popular OCR libraries:

- Google Text Recognition API

Google Text Recognition API

Google Text Recognition API is the process of detecting text in images and video streams and recognizing the text contained therein. Once detected, the recognizer determines the actual text in each block and breaks it down into words and lines. It detects text in different languages (French, German, English, etc.) in real time.

It is worth noting that, in general, this OCR coped with the task. We were able to recognize text both in real-time and from ready-made images of text documents. During the analysis of this library, we identified both advantages and disadvantages of its use.

Advantages:

— Possibility of real-time text recognition

— Ability to recognize text from images;

— Small library size;

— High recognition speed.

Flaws:

— Large size of files with trained data (~30Mb).

Tesseract

Tesseract is an open source OCR library for various operating systems. It is free software released under the Apache license, version 2.0, supports various languages.

Tesseract's development has been funded by Google since 2006, a time when it was considered one of the most accurate and efficient open source OCR libraries.

Be that as it may at that time, we were not very pleased with the results of implementing Tesseract, because The library is incredibly large and does not allow real-time text recognition.

Advantages:

— It is open source;

— Accordingly, it is quite easy to train OCR to recognize the necessary fonts and improve the quality of recognized information. After quick library setup and training, the quality of recognition results increased rapidly.

Flaws:

— Insufficient recognition accuracy, which is eliminated by training and teaching the recognition algorithm;

— Real-time text recognition requires additional processing of the resulting image;

— Low recognition accuracy when using standard files with data about fonts, words and characters.

Anyline

Anyline provides a multi-platform SDK that allows developers to easily integrate OCR functionality into applications. This OCR library attracted us with its numerous options for customizing recognition parameters and the models it provides for solving specific application problems. It is worth noting that the library is paid and intended for commercial use.

Advantages:

— Quite simple setup of recognition of the necessary fonts;

— Real-time text recognition;

— Easy and convenient configuration of recognition parameters;

— The library can recognize barcodes and QR codes;

— Provides ready-made modules for solving various problems.

Flaws:

— Low recognition speed;

— To obtain satisfactory results, initial setup of fonts for recognition is required.

During the analysis, to solve our problems, we settled on the Google Text Recognition API, which combines high speed, easy setup and high recognition results.

The solution we developed allows you to scan paper documents, automatically digitize them and save them into a single database. The quality of recognized information is about 97%, which is a very good result.

Due to the implementation of the developed system, internal (including document processing, their creation and exchange between departments, etc.) was accelerated by 15%.

The article discusses the procedure for teaching the Russian language to the open OCR system Tesseract, developed by Hewlett-Packard.

[Vlasov Igor (igorvlassov at narod dot ru)]

I was looking for OCR here just now for an English-language project. And along the way I got acquainted with the state of affairs regarding my native language. I have not found any free OCR capable of recognizing native speech. But I came across the tesseract project. It is a former commercial multi-platform OCR developed by Hevlet Packard. Now, firstly, it is distributed under the Apache v.2 license. And secondly, you can try to teach her some new language, which is what we will try to do. So, read the manual called TrainingTesseract:

The first step should be to determine the complete set of characters to be used and create a text or word processor file with examples.

When preparing a training file, remember the following:

a minimum number of examples of each character is required. 10 is good, but 5 is enough for rare symbols, and for frequently occurring symbols you need at least 20;

You can't group non-letter characters all together, you need to make the text more realistic, for example:
in the thickets of the south there lived a citrus tree, yes, but a fake copy. 0123456789 !@#$%^&(),.{}<>/? - terrible. Much better:in the thickets (of the south) there lived (there was) citrus, yes! But?<фальшивый> $3456.78 copy#90 /tomato" 12.5%

It would be nice to stretch the text by increasing the line and character spacing;

training data should fit on one page;

there is no need to train on several sizes; a font size of 10 is sufficient. But for text heights less than 15 pixels, you need to train separately or enlarge the image before recognition.

Next, you need to print and scan or use some other method to obtain a picture of the training page. Up to 32 pages can be used. It's best to create pages with a mixture of fonts and styles, including bold and italic.

I'll try to do this only for the Thorndale AMT font, which on my desktop under SUSE10 in OpenOffice looks almost like Times New Roman under Windows. So, I make a separate file with the above text about citrus fontfile.odt, print it, scan it and save it in black and white BMP fontfile.bmp.

The next step is to create a file with the coordinates of the rectangles that contain each character. Fortunately, tesseract can make the file in the required format, although the character set will not match ours. Therefore, you will then need to edit the file manually by entering the correct characters. So, let's do:

tesseract fontfile.bmp fontfile batch.nochop makebox

The result is a file fontfile.txt, which must be renamed to fontfile.box

Here's what I have there:

M 85 132 111 154

Z 114 137 130 154

X 133 137 150 154

{ 170 130 180 162

m 186 137 210 154

r 214 137 228 154

a 233 137 248 154

} 254 130 264 162

M 284 137 306 154

Now the hardest part - you need to edit this file in an editor in which you can replace the wrong characters with the right ones. I use Kate.

Ops, it looks like it was replaced by 2 characters:

M 85 132 111 154

in this case, you need to combine the describing rectangles as follows:

The first number (left) must be the minimum of 2 lines (67)

The second number (bottom) must be the minimum of 2 lines (132)

The third number (right) must be the maximum of 2 lines (111)

The fourth number (top) must be the maximum of 2 lines (154)

so: 67 132 111 154

Note: The coordinate system used in this file starts at (0,0) and is directed from bottom to top and left to right.

Phew, seems to have corrected that. Now we launch tesseract in training mode:

tesseract fontfile.bmp junk nobatch box.train

and look at stderr (or tesseract.log on Windows). If there are errors like FATALITY,

this means that tesseract did not find any instances of the symbol specified in the coordinate file. It gave me an error:

APPLY_BOXES: FATALITY - 0 labeled samples of "%" - target is 2

Boxes read from boxfile: 89

Initially labeled blobs: 87 in 3 rows

Box failures detected: 2

Duped blobs for rebalance: 0

"%" has fewest samples: 0

Total unlabeled words: 1

Final labeled words: 87

Generating training data

TRAINING ... Font name = UnknownFont.

Generated training data for 87 blobs

However, fontfile.tr was generated. Okay, I’ll do without the % sign for now. In general, I should have made more of all the symbols, at least repeating this phrase five times.

In theory, I should repeat this procedure for different fonts and get several different files, and then create prototypes using commands like:

mftraining fontfile_1.tr fontfile_2.tr ...

the result should be 2 files: inttemp (form prototypes) and pffmtable , then

cntraining fontfile_1.tr fontfile_2.tr ...

will create a normproto file (prototypes for character normalization?).

I'm doing this on one of my files. Something happened. Now we need to tell tesseract the set of characters it can output. We use the unicharset_extractor command to generate a unicharset file:

unicharset_extractor fontfile_1.box fontfile_2.box ...

Let's do it. In the resulting file, you must specify the type of character using a mask, the mask is as follows: if a letter is 1, if a small letter is 1, if a capital letter is 1, if a number is -1, otherwise 0.

For example,

b - small letter. Eee mask 0011 or hexadecimal 3

; - not a letter, not a number. Mask = 0

7 is just a number. Mask 1000 or hexadecimal 8.

L is a capital letter. Mask 0101 or hex 5

All my letters are small. I change their mask to 3.

Now you need to take two lists somewhere, one of frequently occurring words, the second of other words, and convert them to DAWG format using another utility:

wordlist2dawg frequent_words_list freq-dawg

wordlist2dawg words_list word-dawg

To begin with, I simply filled each with 5 random words.

The third file, user-words, is usually empty.

The last file to make is called DangAmbigs. It describes cases of mistakenly replacing one file with another, e.g.

The first digit is the number of characters in the first group, the 2nd is the number of characters in the 2nd.

This line shows that 1I can sometimes be incorrectly recognized as Ш.

This file could also be empty.

Now let's put everything together. Files

freq-dawg

word-dawg

user-words

inttemp

normproto

pffmtable

unicharset

Dang Ambigs

we prefix it with rus and place it in the place where the other dictionaries are, for me it’s /usr/local/share/tessdata

All!!!

We are trying to recognize a file. I'll try my one about citrus first:

tesseract image.bmp output -l rus

Here's what I got:

in the south (south) zhshl-(was) tsshtrus, yes! But?

<(1ьалвьшвый>$3456.78 copy #90

kapelvsshnm/pomshdor" 12.5th

Of course, it’s not great, but for the first time, and such a meager sample and vocabulary, I think it’s not bad!

Vlasov Igor (igorvlassov at narod dot ru) - Tesseract in Russian

I needed to get the values of the bitmapped numbers. Numbers were robbed from the screen.

I was wondering if I should try OCR? I tried Tesseract.

Below I will tell you how I tried to adapt Tesseract, why I trained it, and what came of it. The project on GitHub contains a cmd script that automates the training process as much as possible, and the data on which I conducted the training. In short, there is everything you need to teach Tesseract something useful right off the bat.

Preparation

Clone the repository or download zip archive (~6Mb). Install tesseract 3.01 from the official website. If it is not there already, then from the zip-archive/distros subdirectory.

Go to the samples folder and run montage_all.cmd
This script will create the final image samples/total.png, you don’t have to run the script, because I have already placed it in the project root folder.

Why train?

Perhaps the result will be good even without training? Let's check.
./exp1 - as is> tesseract ../total.png total

Let's put the corrected result in a file model_total.txt, in order to compare recognition results with it. An asterisk marks incorrect values.

model_total.txt	Recognition default
27 33 39 625.05 9 163 1,740.10 15 36 45 72 324 468 93 453 1,200.10 80.10 152.25 158.25 176.07 97.50 170.62 54 102 162 78 136.50 443.62 633.74 24 1,579.73 1,576.73 332.23 957.69 954.69 963.68 1,441.02 1,635.34 50 76 168 21 48 30 42 108 126 144 114 462 378 522 60 240 246 459.69 456.69 198 61 255	27 33 39 525 05* 9 153* 1,740 10* 15 35* 45 72 324 455* 93 453 1,200 10* 50 10* 152 25* 155 25* 175 07* 97 50* 170 52* 54 102 152* 75* 135 50* 443 52* 533 74* 24 1,579 73* 1,575 73* 332 23* 957 59* 954 59* 953 55* 1,441 02* 1,535 34* 50 75* 155* 21 45* 30 42 105* 125* 144 114 452* 375* 522 50* 240 245* 459 59* 455 59* 195* 51* 255

default recognition errors

It is clear that there are many errors. If you look closely, you will notice that the decimal point is not recognized, the numbers 6 and 8 are recognized as 5. Will training help get rid of errors?

Training

Training tesseract allows you to train it to recognize images of texts in the form in which you will feed it similar images during the recognition process.
You pass training images to tesseract, correct recognition errors, and pass these edits to tesseract. And he adjusts the coefficients in his algorithms in order to prevent the errors you found from happening in the future.

To perform the training you need to run./exp2 - trained> train.cmd

What is the training process? And the fact is that tesseract processes the training image and forms the so-called. character boxes - extracts individual characters from text by creating a list of bounding boxes. In doing so, it makes guesses as to which character is bounded by the rectangle.

The results of this work are written to the total.box file, which looks like this:
2 46 946 52 956 0
7 54 946 60 956 0
3 46 930 52 940 0
3 54 930 60 940 0
3 46 914 52 924 0
9 53 914 60 924 0
6 31 898 38 908 0
2 40 898 46 908 0
5 48 898 54 908 0
0 59 898 66 908 0
…

Here in the first column there is a symbol, and in columns 2 - 5 the coordinates of the lower left corner of the rectangle, its height and width.

Of course, editing it manually is difficult and inconvenient, so enthusiasts have created graphical utilities to make this work easier. I used one written in JAVA.

After running./exp2 - trained> java -jar jTessBoxEditor-0.6jTessBoxEditor.jar you need to open the file./exp2 - trained/total.png , this will automatically open the file./exp2 - trained/total.box and the rectangles defined in it will be superimposed on the training image.

The contents of the total.box file are shown on the left, and the training image is on the right. Above the image is the active line of the total.box file

The boxes are shown in blue, and the box corresponding to the active line is in red.

I corrected all the incorrect 5's to correct 6's and 8's, added lines with definitions of all the decimal points present in the file and saved total.box

After editing is complete, you need to close jTessBoxEditor for the script to continue working. Further, all actions are performed automatically by the script without user intervention. The script records the learning results under the code ttn

To use the training results for recognition, you need to run tesseract with the -l ttn switch
./exp2 - trained/> tesseract ../total.png total-trained -l ttn

It can be seen that all numbers are now recognized correctly, but the decimal point is still not recognized.

model_total.txt	Recognition default	Recognition after training
27 33 39 625.05 9 163 1,740.10 15 36 45 72 324 468 93 453 1,200.10 80.10 152.25 158.25 176.07 97.50 170.62 54 102 162 78 136.50 443.62 633.74 24 1,579.73 1,576.73 332.23 957.69 954.69 963.68 1,441.02 1,635.34 50 76 168 21 48 30 42 108 126 144 114 462 378 522 60 240 246 459.69 456.69 198 61 255	27 33 39 525 05* 9 153* 1,740 10* 15 35* 45 72 324 455* 93 453 1,200 10* 50 10* 152 25* 155 25* 175 07* 97 50* 170 52* 54 102 152* 75* 135 50* 443 52* 533 74* 24 1,579 73* 1,575 73* 332 23* 957 59* 954 59* 953 55* 1,441 02* 1,535 34* 50 75* 155* 21 45* 30 42 105* 125* 144 114 452* 375* 522 50* 240 245* 459 59* 455 59* 195* 51* 255	27 33 39 625 05* 9 163 1,740 10* 15 36 45 72 324 468 93 453 1,200 10* 80 10* 152 25* 158 25* 176 07* 97 50* 170 62* 54 102 162 78 136 50* 443 62* 633 74* 24 1,579 73* 1,576 73* 332 23* 957 69* 954 69* 963 68* 1,441 02* 1,635 34* 50 76 168 21 48 30 42 108 126 144 114 462 378 522 60 240 246 459 69* 456 69* 198 61 255

recognition errors with training

Magnifying the image

You can increase it in different ways, I tried two ways: scale and resize

total-scaled.png (fragment)	total-resized.png (fragment)
convert total.png total-scaled.png -scale "208x1920"	convert total.png total-resized.png -resize "208x1920"

Because the character images have grown along with the images themselves, the training data under code ttn is out of date. Therefore, I further recognized without the -l ttn switch.

It can be seen that in the image total-scaled.png tesseract confuses 7 with 2, but in total-resized.png it does not confuse it. In both images the decimal point is detected correctly. Recognition of the total-resized.png image is almost perfect. There are only three errors - a space between the digits in the numbers 21, 114 and 61.

But this error is not critical, because... it can be easily fixed by simply removing spaces from the strings.

recognition errors total-scaled.png

recognition errors total-resized.png

model_total.txt	Recognition default	Recognition after training	total-scaled.png	total-resized.png
27 33 39 625.05 9 163 1,740.10 15 36 45 72 324 468 93 453 1,200.10 80.10 152.25 158.25 176.07 97.50 170.62 54 102 162 78 136.50 443.62 633.74 24 1,579.73 1,576.73 332.23 957.69 954.69 963.68 1,441.02 1,635.34 50 76 168 21 48 30 42 108 126 144 114 462 378 522 60 240 246 459.69 456.69 198 61 255	27 33 39 525 05* 9 153* 1,740 10* 15 35* 45 72 324 455* 93 453 1,200 10* 50 10* 152 25* 155 25* 175 07* 97 50* 170 52* 54 102 152* 75* 135 50* 443 52* 533 74* 24 1,579 73* 1,575 73* 332 23* 957 59* 954 59* 953 55* 1,441 02* 1,535 34* 50 75* 155* 21 45* 30 42 105* 125* 144 114 452* 375* 522 50* 240 245* 459 59* 455 59* 195* 51* 255	27 33 39 625 05* 9 163 1,740 10* 15 36 45 72 324 468 93 453 1,200 10* 80 10* 152 25* 158 25* 176 07* 97 50* 170 62* 54 102 162 78 136 50* 443 62* 633 74* 24 1,579 73* 1,576 73* 332 23* 957 69* 954 69* 963 68* 1,441 02* 1,635 34* 50 76 168 21 48 30 42 108 126 144 114 462 378 522 60 240 246 459 69* 456 69* 198 61 255	22* 33 39 625.05 9 163 1,240.10* 15 36 45 22* 324 468 93 453 1,200.10 80.10 152.25 158.25 126.02* 92.50* 120.62* 54 102 162 28* 136.50 443.62 633.24* 24 1,529.23* 1,526.23* 332.23 952.69* 954.69 963.68 1,441.02 1,635.34 50 26* 168 2 1* 48 30 42 108 126 144 1 14* 462 328* 522 60 240 246 459.69 456.69 198 6 1* 255	27 33 39 625.05 9 163 1,740.10 15 36 45 72 324 468 93 453 1,200.10 80.10 152.25 158.25 176.07 97.50 170.62 54 102 162 78 136.50 443.62 633.74 24 1,579.73 1,576.73 332.23 957.69 954.69 963.68 1,441.02 1,635.34 50 76 168 2 1* 48 30 42 108 126 144 1 14* 462 378 522 60 240 246 459.69 456.69 198 6 1* 255

Digitizing images one at a time

Ok, what if you want to digitize images one by one in real time?

I'll try one at a time.
./exp5 - one by one> for /r %i in (*.png) do tesseract "%i" "%i"
Two and three digit numbers are not defined at all!




	625.05


	1740.10

Digitization in small batches

What if you need to digitize images in batches of several images (6 or 10 in a batch)? I'll try ten.
./exp6 - ten in line> tesseract teninline.png teninline

Recognizable, even without a space in the number 61.

conclusions

In general, I expected worse results, because... small bitmap fonts are an edge case due to their small size, distinct granularity, and consistency—different images of the same character are exactly the same. And practice has shown that artificially clouded numbers increased in size are better recognized.

Image pre-processing has a greater effect than training. Resize with smoothing: convert -resize…

Recognition of individual “short” two and three-digit numbers is unsatisfactory - the numbers must be collected into packages.

But overall, tesseract coped with the task almost perfectly, despite the fact that it is tailored for other tasks - recognizing inscriptions on photos and videos, scans of documents.

Tesseract-ocr is a free text recognition library. In order to connect it you need to download the following components:
Leptonica - http://code.google.com/p/leptonica/downloads/detail?name=leptonica-1.68-win32-lib-include-dirs.zip
Latest version of tesseract-ocr (currently 3.02) - https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-3.02.02-win32-lib-include-dirs.zip&can= 2&q=
Russian language training data - https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.rus.tar.gz
You can assemble everything yourself by downloading the source codes, but we won’t do that.

Having created a new project, we connect the paths to the lib and h files. And we write a simple code.
#include #include int main() ( tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI(); printf("Tesseract-ocr version: %s\n", myOCR->Version()); printf("Leptonica version: %s \n", getLeptonicaVersion()); if (myOCR->Init(NULL, "rus")) ( fprintf(stderr, "Could not initialize tesseract.\n"); exit(1); ) Pix *pix = pixRead ("test.tif"); myOCR->SetImage(pix); char outText; lstrcpy(outText, myOCR->GetUTF8Text()); printf("OCR output:\n\n"); printf(outText); myOCR ->Clear(); myOCR->End(); pixDestroy(&pix); return 0; )

We include lib files:
libtesseract302.lib
liblept168.lib

We compile - the program is successfully created. Let's take the following picture as an example:

We run the program so that the information is output to a file (since UTF-8 will be chaos in the console):
test > a.txt

File contents below:
Tesseract-ocr version: 3.02
Leptonica version: leptonica-1.68 (Mar 14 2011, 10:47:28 AM)
OCR output:

“Substituting this expression into (63), we see that the o-
The single-sideband signal is modulated
and the modulation depth is a.
7 Envelope XO) of the primary signal directly
Actually, it is impossible to observe on an oscilloscope, so
How is this signal not narrowband, but rather
‘in this case, there is no “visibility” of the envelope, But
with single-sideband modulation, a narrow-band
‘loose signal with the same envelope, and then it
“and manifests itself explicitly and sometimes (as in the description
“in this case) brings confusion into the minds of inexperienced
and new researchers..
6.4. "FORMULA COSTAS"
g y
With the advent of OM in textbooks, magazines `
In articles and monographs the question of
about what gain comes from the transition from amplitude
modulation to single sideband. Much has been said
conflicting opinions. In the early 60s the American
Rican scientist J. Costas wrote that, having looked at
Having searched the extensive journal literature on OM, he
found in each article its own assessment of energy
“Cal gain relative to AM - from two to,
several dozen. As a result, he established
-that the winnings indicated in each article are co-
is approximately (Z-K-Y!) dB, where the M-number is
› authors of this article.
Yo, ‘ 11 Even if this joke is inaccurate, it is still correct -
‘the notes reflect the discord that existed
; in those years. In addition to the fact that different authors produced
D made comparisons under different conditions and in different ways
They determined the energy gain in this way: 1
‘‚ `They made a lot of different mistakes. 4 "
‚`Here are examples of some reasoning. ",
1. With conventional AM, assuming the carrier power

Latest articles

Tesseract training program. Developing a document management solution: how we chose an OCR library for our tasks

We analyzed the three most popular OCR libraries:

Google Text Recognition API

Tesseract

Anyline

Preparation

Why train?

Training

Magnifying the image

Digitizing images one at a time

Digitization in small batches

conclusions

Popular articles

Sections

Pages

Special projects

Contacts