Python Tesseract OCR training for specific wordlist

I am new to OCR and Tesseract.

So far, I have a working script that extracts pretty good text from images.

My doubt: is it possible to train tesseract to extract only words / characters represented in some kind of dictionary file?

For example, I have a .txt with a large list of people's names and I want to train Tesseract that "SONIA" is not "50NlA" and "YANNICK", not "VANNlD", etc.

If he has a list of all possible names, can he give better precision? If the original image is text with a lot of people's names and other information about those faces, but I only want to get the names from the ocr and ignore the "noisy information", what can I do? Sorry if this is a stupid question.

I have read this https://groups.google.com/forum/#!topic/tesseract-ocr/r5qkHxQOT98 and the guide http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html and created eng.user words and bazaar files ... what should be the next step? Since it gives me the same results ...

Thanks a lot for your time and patient.

+3


source to share





All Articles