Nice image, but no text from OCR, why? Python, Skimage, PIL, Tesseract

I am new to image processing, CV and OCR. So far, I think this is an amazing item and that I'm ready to dig further.

Imagine I have this original image: Original page

I will resize it: resized

Then I found the regional highs and got this image (to avoid lighter backgrounds and too noisy): Regional Maxima

Then send the above image to the threshold and with processing get this image: Tresholded It seems to me that this image is not 100% binary ... if I enlarge it, it displays some gray pixels inside characters ...

I thought this last image should be good enough / (very good, really) for OCR, don't you think? But the text doesn't come out of it ...

My code:

#http://stackoverflow.com/questions/18813300/finding-the-coordinates-of-maxima-in-an-image
from PIL import *
from PIL import Image
import numpy as np
from skimage import io
from skimage import img_as_float
from scipy.ndimage import gaussian_filter
from skimage.morphology import reconstruction
import pytesseract

im111 = Image.open('page.jpg')

basewidth = 1000
wpercent = (basewidth / float(im111.size[0]))
hsize = int((float(im111.size[1]) * float(wpercent)))
image_resized = im111.resize((basewidth, hsize), Image.ANTIALIAS)
image_resized.save('page2.jpg')

image = img_as_float(io.imread('page2.jpg', as_grey=True))
image = gaussian_filter(image, 1)
seed = np.copy(image)
seed[1:-1, 1:-1] = image.min()
mask = image
dilated = reconstruction(seed, mask, method='dilation')
image = image - dilated

#print type(image)

#io.imsave("RegionalMaxima.jpg", image)

im = np.array(image * 255, dtype = np.uint8)
a = np.asarray(im)
img = Image.fromarray(a)

#img.show()

#print type(img)
#img.save('RegionalMaximaPIL.jpg')

#image2 = Image.open('RegionalMaxima.jpg')

minima, maxima = img.getextrema()
print "------Extrema1----------" + str(minima), str(maxima)
mean = int(maxima/4)
im1 = img.point(lambda x: 0 if x<mean else maxima, '1')
im1.save('Thresh_calculated.jpg')
#im1.show()

mini, maxi = im1.getextrema()
print "-------Extrema2(after1stTresh)---------" + str(mini), str(maxi)

im2 = im1.point(lambda x: 0 if x<128 else 255, '1')
im2.save('Thresh_calculated+++.jpg')
im2.show()

text = pytesseract.image_to_string(im2)
print "-----TEXT------" + text

      

What am I doing wrong? pytesseract.image_to_string (im1) with a threshold image should already get some text: /

Other doubts: the second "getextrema ()" results shouldn't be 0 and 255 ??? I am confused as they still present the same amount to me up to the first threshold ... so the image captured in the second phrase is all black.

Thanks a lot for your time and help.

+3


source to share


2 answers


Sorry, I don't speak python, but I have some experience with tesseract

from the command line. From some experiments I did a while ago, I think the sweet spot for tesseract recognizes letters when they are around 30-50px.

Following this logic, I selected a portion of your image using ImageMagick to cover words Nokia

and 225

. Then I resized the resulting two lines of text plus some vertical space to 160 pixels, i.e. So that the letters are 50 pixels.

convert nokia.jpg -crop 1000x800+1800+1000 -resize x160 x.jpg

      

enter image description here

Then I ran tesseract

like this and looked at the recognized text:



tesseract x.jpg text
Tesseract Open Source OCR Engine v3.02.02 with Leptonica

more text*
NOKIA
225

      

I'm not pretending that this is a miracle solution. I'm just saying that I would chop off some text - perhaps using "Connected Component Analysis" (or whatever) and resize it so that the text is about 30-80px and see what it does.

Feel free to ask any questions in the comments and I will see what I can do - or maybe some other smart people will know more and chip in their thoughts ...

I had a moment to do some more experimentation, so I tried to find a sweet spot for the overall height of the yuor image to help tesseract

be more successful. I changed the height from 100 to 500px in 10 increments and then looked at the resulting OCR text like this:

for x in $(seq 100 10 500); do 
  convert nokia.jpg -resize x$x small.jpg
  echo Height:$x
  tesseract small.jpg text >/dev/null 2>&1 && grep -E "NOKIA|225" text*
done

Height:100
Height:110
Height:120
Height:130
Height:140
NOKIA
225
Height:150
Height:160
Height:170
225
Height:180
Height:190
NOKIA
225
Height:200
225
Height:210
NOKIA
225
Height:220
NOKIA
225
Height:230
NOKIA
225
Height:240
NOKIA
225
Height:250
NOKIA
225
Height:260
Height:270
NOKIA
225
Height:280
Height:290
NOKIA
225
Height:300
NOKIA
225
Height:310
Height:320
NOKIA
225
Height:330
Height:340
Height:350
NOKIA
225
Height:360
NOKIA
225
Height:370
NOKIA
225
Height:380
NOKIA
225
Height:390
Height:400
Height:410
Height:420
Height:430
Height:440
Height:450
NOKIA
225
Height:460
Height:470
Height:480
Height:490
Height:500

      

+5


source


I found it sometimes has JPG issues, but works fine on PNGs of the same image. So I convert the file to PNG and read it.



0


source







All Articles