Image processing to improve Tesseract OCR

I am using tesseract to convert documents to text. Document quality fluctuates wildly and I'm looking for advice on what image processing can improve the results. I noticed that text that is highly pixelated, such as generated by fax machines, is especially difficult to handle with tesseract - apparently all those jagged edges of the characters are mixing up the shape recognition algorithms.

What image processing techniques will improve accuracy? I used Gaussian blur to smooth out the pixelated images and saw a slight improvement, but I hope there is a more specific method that will give better results. Let's say a filter that was tuned to black and white images that smoothed out irregular edges, and then a filter that increased the contrast to make the characters sharper.

Any general tips for anyone new to image processing?

+3


source to share





All Articles