How to reduce the size of a PDF file generated by tesseract?

Question

How to reduce the size of a PDF file generated by tesseract?

My (network) application setup is like this: I get user-uploaded PDF files, I run OCR on them and show them the OCRed PDF. Since everyone is online, minimizing the size of the resulting PDF file is key to reducing download times and waiting for the user.

The file I get from the user is sample.pdf (I created an archive with the source files as well as the ones I create here: https://dl.dropboxusercontent.com/u/1390155/tess-files/sample.zip ). I am using tesseract 3.04 and am doing the following:

gs -r300 -sDEVICE=tiff24nc -dBATCH -dNOPAUSE -sOutputFile=sample.tiff sample.pdf
tesseract sample.tiff sample-tess -l fra -psm 1 pdf

The OCR result is good, but the generated PDF is now about 2.5 times larger

original pdf file size: 60k
final pdf size: 147K

So, I'm asking you how to reduce the size of the generated PDF while keeping the OCR output?

One obvious solution is to reduce the resolution when generating the tiff, but I don't want to do that as it might affect the OCR result.

The second thing I tried was to reduce the size of the PDF after tesseract using ghostscript:

gs -o sample-down-300.pdf   -sDEVICE=pdfwrite   -dDownsampleColorImages=true \
   -dDownsampleGrayImages=true   -dDownsampleMonoImages=true  \
   -dColorImageResolution=300   -dGrayImageResolution=300  \
   -dMonoImageResolution=300   -dColorImageDownsampleThreshold=1.0  \
   -dGrayImageDownsampleThreshold=1.5   -dMonoImageDownsampleThreshold=1.0 \
    sample-tess.pdf

It helps a little, the generated file is only 101KB, which is about 1.5 times the size of the original. I could live with this, but it also seems to have an effect on the OCR result. For example, there is no space between "RESTAURANT" and "PIZZERIA" (second line).

Another (simpler) option with ghostscript, using the ebook parameter, results in a 43k file with lower quality in PDF and the same missing spaces issue:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \
    -dNOPAUSE -dBATCH  -dQUIET -sOutputFile=sample-ebook.pdf \
     sample-tess.pdf

The lower quality of the PDF is fine, but again, I really don't want to compromise the OCR.

I've done other tests using PNG and JPEG but the OCR results are always going down (even a little) and the resulting PDF is no less. For example, with PNG:

convert -density 300 sample.pdf -transparent white sample.png
tesseract sample.png sample-tess-png -l fra -psm 1 pdf

There is no amount (55.50) and the final PDF size is 149,000.

So, to summarize, here are my questions:

Can someone explain why reducing the size of the PDF using ghostscript affects the OCR result? I thought the text layer and the image layer were independent ...
Are there any options tesseract can give to reduce the quality of the images when it generates the PDF?
I read that other solutions like ABBYY OCR use Mixed Rasterized Content (MRC) to reduce file size. Does the Tesseract do it already? If not, are there any open source tools or native CLI tools that do this that I could use to reduce the generated tesseract PDF?

Again, Im OK compromises the quality of the PDF images (although I would like to preserve the colors ideally) as long as the user can search for text and select to copy / paste from the PDF.

Any help is greatly appreciated!

+3

pdf ocr pdf-generation ghostscript tesseract

seb 06 nov. 14 at 8:57

source to share

3 answers

KenS · Answer 1 · 2014-11-06T11:40:01+0000

Problem 1, I can't see any file attached to this, so I'm guessing in the dark.

There is no "text layer" or "image layer" in PDF, PDF can have layers, but they are independent. Text and images are inserted into the file "as is". Of course, the result of rendering a PDF file to a TIFF file creates one image file.

The original PDF document will save the text as text, using fonts, the TIFF file will have everything presented as an image. I'm not sure exactly how tesseract works, and without an example of its output, I can't be sure, but I expect that what it does is leave the displayed image intact in the output PDF and add text using render mode 3 (nor stroke or fill, i.e. invisible). This is what you described as "MCR" above.

This means that the original PDF file is small because many (possibly all) content is described as vector data. The resulting TIFF file is large because its full-size raster map, the savings gained with the vector representation have been lost. It is then converted to PDF (so still big) and then text and fonts are added, which of course increases its size.

The only thing that will significantly affect the size of this file is to reduce the size of the bitmap, i.e. the TIFF file you use to create the final output PDF.

Transferring the original PDF file to TIFF and OCR before rendering is unlikely to affect the final PDF file size (caveat; compression may work better because there may be more areas of "flat" color)

Without seeing the original file and the final file, I can't say more and I can't test it myself (I don't have Tesseract), but it seems to me that the only real solution is for Tesseract to scale down the image before creating the final output PDF ...

nguyenq · Answer 2 · 2014-11-22T05:03:39+0000

Since you are using Tesseract 3.04 it supports various compression modes which you can check out.

  --force-transcode=[true|false]
  --force-lossless=[true|false]
  --force-compression-algorithms=[dct|flate|g4|lzw|jpx|jbig2]

Issue 1285 , 1300 .

Alasdair · Answer 3 · 2014-11-15T15:37:59+0000

First, Tesseract is an OCR engine. You cannot expect any of the features it has, other than OCR, to be optimized. This OCR is very good and not another. It does other things, for example, it spawns whatever image you give it if it hasn't been spawned already (using Otsu's method), but you would have better results by spawning the image first and then passing it to Tesseract, assuming that you have an idea of what you are giving him.

None of this is a Tesseract problem. The reason the space changes is because the PDF Query Viewer is guessing the words / strings as they are not encoded. If the text is the same and the spacing is broken, it is completely a PDF viewing problem. The reason it changes between PDFs is because you are resizing the resolution / canvas and interfering with calculations between words / lines with the PDF viewer. For comparison, you can view the content object for any of the pages in Adobe Acrobat under Preflight | Options | View the internal structure of the PDF.

The first question I would ask is why are PDF images changed at all? Of course they shouldn't be, they should be exactly the same images you started with, only with the text layer (yes, the text layer, its text and the layer above the image = text layer) invisible at the top. You can use View PDF Internal Structure (or Notepad) to view the size of any of the image objects and see if they are the same size. If you don't want them to stay the same, or you want to keep them and then replace them in the final PDF.

Otherwise, the text will not be compressed. PDF supports Deflate. No doubt there is an option in Ghostscript or PDFTK to compress all content objects.

You don't have to reduce the quality of the images in PDF. If I were one of your users / customers, I don't think I would be happy that what you returned to me was not the same as what I gave you - it would make your service useless.

How to reduce the size of a PDF file generated by tesseract?

More articles: