Extracting stream from pdf in python

Question

Extracting stream from pdf in python

How can I extract part of this stream (the one called BLABLABLA) from the pdf file that contains it?

<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0  /Resources<</ColorSpace<</CS0 563 0 R>>/ExtGState<</GS0 568 0 R>>/Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>/ProcSet[/PDF/Text/ImageC]/Properties<</MC0<</BLABLABLA 584 0 R>>/MC1<</SubKey 582 0 R>>>>/XObject<</Im0 578 0 R>>>>/Rotate 0/StructParents 0/Type/Page>>

Or, in other worlds, how can I extract a subsection from a pdf stream?

I would like to use some python library (like pyPdf or ReportLab), but even some C / C ++ libs should work well for me.

Can anyone help me?

+1

python stream pdf pypdf reportlab

Giancarlo 09 jan. 09 at 19:47

source to share

3 answers

There is a python text extraction tool in google code called pdf miner . I don't know if it will do what you want, but it might be worth a look.

+3

Ferruccio 10 jan. 09 at 12:43

source to share

I haven't used this myself, but maybe the gfx module in swftool can help you.

0

user49117 09 jan. 09:41 pm

source to share

Tony meyer · Accepted Answer · 2009-01-11T22:06:59+0000

IIUC, a PDF stream is just a sequence of binary data. I think you want to extract part of an object. Do you need a standard object such as an image or text? It would be much easier to give you a sample code if there was a real example.

This can help you get started:

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

Extracting stream from pdf in python

More articles: