Parsing PDF - Extract One Page

Question

Parsing PDF - Extract One Page

I recently wrote a Python program that allowed me to read a PDF, take some commands from the user, and output some or all of the original PDF from pages in different orders. You can also select the pages that interest you. There was a great library for her at the time, PyPDF2 . He did all the hard climb.

Now I am working in another language (Haskell) that has little or no PDF support which I can find. I am considering creating my own personal library. However, when looking at the contents of a PDF file, I am having a hard time determining where certain pages are. I can tell how many pages there are in the file, but I cannot look at a specific part of the file and say, "This is page X of Y." So how can I highlight content based on pages? How can I split a file based on pages if I don't know which page content is on?

+3

pdf pdf-generation

Michael G June 13. '15 at 3:44

source to share

1 answer

David van Driessche · Accepted Answer · 2015-06-13T17:31:13+0000

The first thing you need is a copy of the PDF specification. You can download this for free from the Adobe website here: http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

In this document see section 7.7.3 for an explanation of how the Page Tree works.

Basically, a PDF file contains a tree (Adobe assumes it should be a balanced tree, but you don't have to), starting with a Pages object, which optionally contains multiple middle-level objects, and ending in Page objects. For example:

Pages
. Pages
  . Page (1)
  . Page (2)
  . Page (3)
. Pages
  . Pages
    . Page (4)
    . Page (5)
  . Pages
    . Page (6)
    . Page (7)

The number of levels in this tree is not limited. To find a given page, you have to traverse the tree from start to finish, assigning page numbers when you find the Page sheet objects. In the example above, I specified which page these objects will be displayed on (starting at page index 1).

Once you have a page object, you can use it (and potentially its parents) to find the resources you need for that page. Look again at the PDF specification for the Resources dictionary and think about inheritance.

Parsing PDF - Extract One Page

More articles: