Python Tesseract OCR training for specific wordlist

I am new to OCR and Tesseract.

So far, I have a working script that extracts pretty good text from images.

My doubt: is it possible to train tesseract to extract only words / characters represented in some kind of dictionary file?

For example, I have a .txt with a large list of people's names and I want to train Tesseract that "SONIA" is not "50NlA" and "YANNICK", not "VANNlD", etc.

If he has a list of all possible names, can he give better precision? If the original image is text with a lot of people's names and other information about those faces, but I only want to get the names from the ocr and ignore the "noisy information", what can I do? Sorry if this is a stupid question.

I have read this https://groups.google.com/forum/#!topic/tesseract-ocr/r5qkHxQOT98 and the guide http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html and created eng.user words and bazaar files ... what should be the next step? Since it gives me the same results ...

Thanks a lot for your time and patient.

+3
python string image-processing ocr tesseract


source to share


No one has answered this question yet

See similar questions:

1
Tesseract training: just a few words
1
Tesseract OCR: recognize only dictionary words

or similar:

3119
What is the difference between Python list methods that are appended and expanded?
2818
Finding the index of an element by specifying the list that contains it in Python
2664
How can I check if a string contains a specific word?
2047
How do I concatenate two lists in Python?
1798
Getting the last item in a list
1782
How can I get the number of items in a list?
1646
Why is it string.join (list) instead of list.join (string)?
1170
Create a list comprehension dictionary
1
tesseract OCR - Q is defined as O
0
Teaching a tesseract with a new language that has almost the same script as Vietnamese



All Articles
Loading...
X
Show
Funny
Dev
Pics