How do I search my PDF file using grep?
I followed the ideas from this thread but it didn't work. https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for a fact that the "Filter" appears at least 100 times in this book.
Any ideas?
source to share
If you can indeed grep a given string (which you can "see" and read on a rendered or printed PDF page) from a PDF, even with help pdftotext
, you should indeed be very lucky.
First: most of the recommendations on the link you provided unix.stackexchange.com
are very ignorant (to put it most politely). Most of the answers out there are clearly written by people who are not familiar with the huge range of PDF options out there.
In your case, you are first trying to convert the file with by pdftotext
passing the result to stdout.
There are many types of PDF that pdftotext
cannot extract text at all. The reasons for this may be (listed below):
-
The "text" you see is not font-based. It can be a single large bitmap created by scanning or other manufacturing process and then embedded in a PDF file wrapper. This can make the page text-only.
-
The "text" you see is not font-based. It can be a series of small vector drawings (or small bitmaps) that only look like lines of text to our eyes and brain.
There are many software applications that convert fonts into so-called "outlines". The reason for this seemingly strange behavior could be:
- Workaround for licensing issues (when a certain font prohibits its embedding).
- Set up a handicap when trying to extract text.
- Accidentally wrong setting in PDF creation app.
-
The font is embedded as a subset in the PDF file (using PDF creation software - users usually do not have much control over the details of this operation) and uses a "custom" encoding, but the file does not provide a table
toUnicode
for mapping glyphs to characters."Symbols" are clearly defined shapes in every typeface drawn on the screen. Glyphs for Computer Characters - Our eyes just see these shapes and our brains translate them into non-table symbols
toUnicode
. Programs such aspdftotext
require a tabletoUnicode
to reverse the translation of glyphs back to characters.
You can use the named command line utility pdffonts
to get a first impression of the fonts your PDF is using. Output example:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, you need to extract the text (and your grepping method for strings):
- Even if the named column
uni
(saying the maptoUnicode
is embedded in the PDF file) saysno
for each individual font, the columnencoding
does not containcustom
, butbuiltin
(which means the glyph character mapping -> is provided with a font file that is of typeType 1
.
To summarize: Without access to your PDF file, it's impossible to tell why you can't "grep" for the lines you're looking for!
source to share