How do I search my PDF file using grep?

I followed the ideas from this thread but it didn't work. https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files

 pdftotext PercivalWalden.pdf - | grep 'Slepian'
 pdftotext PercivalWalden.pdf - | grep 'Naive'
 pdftotext PercivalWalden.pdf - | grep 'Filter'

      

I know for a fact that the "Filter" appears at least 100 times in this book.

Any ideas?

+3


source to share


1 answer


If you can indeed grep a given string (which you can "see" and read on a rendered or printed PDF page) from a PDF, even with help pdftotext

, you should indeed be very lucky.

First: most of the recommendations on the link you provided unix.stackexchange.com

are very ignorant (to put it most politely). Most of the answers out there are clearly written by people who are not familiar with the huge range of PDF options out there.

In your case, you are first trying to convert the file with by pdftotext

passing the result to stdout.

There are many types of PDF that pdftotext

cannot extract text at all. The reasons for this may be (listed below):

  • The "text" you see is not font-based. It can be a single large bitmap created by scanning or other manufacturing process and then embedded in a PDF file wrapper. This can make the page text-only.

  • The "text" you see is not font-based. It can be a series of small vector drawings (or small bitmaps) that only look like lines of text to our eyes and brain.

    There are many software applications that convert fonts into so-called "outlines". The reason for this seemingly strange behavior could be:

    • Workaround for licensing issues (when a certain font prohibits its embedding).
    • Set up a handicap when trying to extract text.
    • Accidentally wrong setting in PDF creation app.
       
  • The font is embedded as a subset in the PDF file (using PDF creation software - users usually do not have much control over the details of this operation) and uses a "custom" encoding, but the file does not provide a table toUnicode

    for mapping glyphs to characters.

    "Symbols" are clearly defined shapes in every typeface drawn on the screen. Glyphs for Computer Characters - Our eyes just see these shapes and our brains translate them into non-table symbols toUnicode

    . Programs such as pdftotext

    require a table toUnicode

    to reverse the translation of glyphs back to characters.




You can use the named command line utility pdffonts

to get a first impression of the fonts your PDF is using. Output example:

pdffonts paper-projectiris---final.pdf 

 name                       type         encoding       emb sub uni object ID
 -------------------------- ------------ -------------- --- --- --- ---------
 TCQJEF+CMCSC10             Type 1       Builtin        yes yes no      96  0
 VPAFLY+CMBX12              Type 1       Builtin        yes yes no      97  0
 CWAIXW+CMTI12              Type 1       Builtin        yes yes no      98  0
 OBMDLT+CMR12               Type 1       Builtin        yes yes no      99  0

      

In this case, you need to extract the text (and your grepping method for strings):

  • Even if the named column uni

    (saying the map toUnicode

    is embedded in the PDF file) says no

    for each individual font, the column encoding

    does not contain custom

    , but builtin

    (which means the glyph character mapping -> is provided with a font file that is of type Type 1

    .



To summarize: Without access to your PDF file, it's impossible to tell why you can't "grep" for the lines you're looking for!

+6


source







All Articles