OCR: select the best row based on the latest N results (adaptive filter for OCR)
I've seen several questions about choosing the best OCR result from different engines, and the answer is usually "choose the best engine". I want, however, to capture multiple frames of text images, with possible temporary occlusions or temporary errors. I am using tesseract-ocr with python-tesseract.
Given the OCR output from the last N frames, I want to decide which is the best result (in turn, for simplicity).
For example, for N = 3, we could use median filtering:
ABXD XBCX AXCD
When there are 2 of 3 identical symbols, the majority wins, so the result is ABCD. However, this is not so easy with different line sizes. If I expect a given size M (when scanning the pricing table, rows are usually XX.XX), I can always penalize for rows larger than M.
If we were talking numbers, median filtering would work reasonably well (simple background subtraction in computer vision) or adaptive least mean squares filtering. There's also a problem with similar characters:
can be very similar, depending on the font.
I also thought about using string spacing between each line. For example, select the row with the smallest sum of distances with others.
Has anyone addressed this issue before? Is there any known algorithm for this kind of problem that I should know?
source to share