Tesseract OCR gives poor output

Question

Tesseract OCR gives poor output

I am using the C # wrapper for the Tesseract library (3.02 if I'm not mistaken) ( https://github.com/charlesw/tesseract ). It works for me and gives the result, but this output is essentially garbage. Often it gives nothing, and when it does give something, it's often a mess. I know this works in theory because I've tried it on some really perfect images and it works. I am wondering if anyone can help me diagnose the problems and suggest some ways to improve the accuracy of Tesseract. I have already converted all images to black and white and the resolution is set to 300x300. I am not programming the line, but as you can see below, they are pretty straight forward.

This image works fine

It doesn't work at all, either gibberish or nothing at all

I tried flipping through the colors, thinking it might give more contrast (since most of the text is black on a white background, while the workers are white on a black background). But:

Doesn't work at all whereas

Works great again.

I suspect this is due to the extra letter spacing in INVOICE. But there must be some way to get decent results with a heavier font. Any suggestions are welcome, I am a relative here.

+3

c # bitmap tesseract

gmaster Jul 29. 15 at 20:07

source to share

1 answer

Fbi992 · Accepted Answer · 2015-07-30T10:11:59+0000

If possible, you should consider using higher resolution images. Another problem with Lockheed Martin and Payments images is probably the gap between the letters that are too small. Tesseract cannot detect single letters if they are (almost) connected to the next letter of the word. I would suggest an image processing library like openCV to improve your results. You can try erosion / expansion. This will separate letters if the correct parameters are used for the kernel. Use different kernels to see which works best for you.

Mat element = getStructuringElement( erosion_type,
                                   Size( 2*erosion_size + 1, 2*erosion_size+1 ),
                                   Point( erosion_size, erosion_size ) );

erode( src, erosion_dst, element );

What helped me a lot when I was working on my project was using an adaptive threshold. I found this to be more efficient than just converting it to grayscale or binary. Note: Java code should be very similar to C though.

Imgproc.adaptiveThreshold(cropedIm, cropedIm, 255,  Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C, Imgproc.THRESH_BINARY, 29, 10);

This is what I get after selecting one of your images in Pixtern, my android project (source code on github). I used an adjustment threshold but didn't widen / blur and the result is already pretty good.

Image after applying adaptive threshold

Result (image = Lockheed Martin Aeronautim)

For Image Payments and similar: Try using a normal threshold and invert the image (black font, white background). Again, dilatation / erosion can be used afterwards. Java code:

//results in binary image
Imgproc.threshold(cropedIm,cropedIm,127,255,Imgproc.THRESH_BINARY);
//Inverting image
Core.bitwise_not(cropedIm, cropedIm);

Tesseract OCR gives poor output

More articles: