How does OCR work in Google Drive?

I noticed that Google Drive recognizes text in PDFs (and other files like images and text documents). Out of curiosity, I want to know what they did to show the available and searchable img tags. For example, when I view a Google Drive document in Chrome developer tools, each page is an image, but it doesn't behave like an image because the text is selectable. On the other hand, when I zoom in, it seems like a different higher resolution image is loaded. I think the same trick that scribd uses.

I also read that Google improves tesseract-ocr and that the Google Books team helped with the implementation of OCR in Google Drive, but I'm not sure what is the process for generating img tags the way they do it.

What's going on behind the scenes?

Thank!

+3


source to share


2 answers


I cannot be sure what is going on exactly, but I will draw conclusions. If you look at the HTML for viewing a PDF file on your disk, you will find something like this.

<div id="page-pane" class="">
   <div id=":2h.page.0" class="page-element goog-inline-block" style="width: 820px;">
      <div>
         <div class="highlight-pane"></div>
         <div class="highlight-pane">
            <div class="highlight selection-highlight" style="left: 154px; top: 142px; width: 268px; height: 13px;"></div>
            <div class="highlight selection-highlight" style="left: 105px; top: 164px; width: 73px; height: 14px;"></div>
            <div class="highlight selection-highlight" style="left: 154px; top: 181px; width: 128px; height: 13px;"></div>
         </div>
         <div class="highlight-pane"></div>
         <div class="highlight-pane"></div>
         <img class="page-image" style="width: 800px; height: 1131px; display: none;" src="https://docs.google.com/file/d/0BzxfQAgMGNM6VGg4RFlBZkdoOWM/image?pagenumber=1&amp;w=138" /><img class="page-image" style="width: 800px;" src="https://docs.google.com/file/d/0BzxfQAgMGNM6VGg4RFlBZkdoOWM/image?pagenumber=1&amp;w=800" />
         <p id=":2h.a11y.0" class="accessibility-text" tabindex="-1"></p>
      </div>
   </div>

      

There are four highlight-pane

divs and a img

div inside 2h.page.0

(page 0 of the pdf). img

div shows the image you are talking about. It's just a simple image, no OCR. The selected text you mentioned refers to the second highlight-pane

one in which dynamic elements are added to it dynamically when you drag a box on the image. Three highlight-pane

divs within the second represent the selected text (which corresponds to three lines of selected text).



When you visit the page, the following happens.

  • View a page image from pdf saved on your disk.
  • You select something on the page. You are creating a dragbox.
  • Selecting runs javascript which triggers OCR to pdf (OCR output may already be computed).
  • OCR output is added to a div inside a highlight-pane

    div
+3


source


There are two main methods for OCR: matrix matching and extraction function. Of the two character recognition methods, matrix matching is the simpler and more common.

Matrix Matching compares what the OCR scanner sees as a character with a library of character matrices or patterns. When an image matches one of these prescribed matrixes of dots at a given level of similarity, the computer labels that image as an ASCII compliant character.

The extraction function is OCR without strict adherence to prescribed templates. Also known as Intelligent Character Recognition (ICR) or Topological Feature Analysis, this method depends on how much "computer" intelligence is applied by the manufacturer. The computer looks for common features such as open areas, closed shapes, diagonal lines, line intersections, etc. This technique is much more versatile than matrix matching. Matrix matching works best when OCR encounters a limited repertoire of type styles, with little or no variation within each style. If characters are less predictable, function or topographic analysis is superior.



If you want to know more please go to: http://www.dataid.com/aboutocr.htm

0


source







All Articles