Extract only text from PDF files using CGPDFScanner

Question

Extract only text from PDF files using CGPDFScanner

There are several questions (some answered and others not) about extracting plain text from PDF files. Stackoverflow was helpful to point out that the Adobe Adobe documentation is very useful for detecting objects during parsing: for example, to use callbacks when using CGPDFScanner, use the "BT" and "ET" PDF operators to create callbacks.

The apple documentation shows an example callback:

static void op_BT (CGPDFScannerRef s, void *info) {
    const char *name;
    if (!CGPDFScannerPopName(s, &name))
        return;
    printf("BT /%s\n", name);   
}

And besides other CGPDFScanner commands, the above callback is configured by creating first:

myTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback (myTable, "BT", &op_BT);

So far so good, but Apple's documentation doesn't seem to help low to medium programmers like me figure out the next step: besides defining a text block (presumably between BT and BE callbacks?), What are the few steps / lines during / in / out callback to capture identified text block in NSString?

Many thanks.

+3

stream text objective-c pdf file-format

MikeLondonUK May 12 '15 @ 9:16 am

source to share

1 answer

David van Driessche · Answer 1 · 2015-05-12T10:24:27+0000

The first thing you need to do is download the PDF link. It is an ISO standard these days, but you can download the Acrobat SDK ( http://www.adobe.com/devnet/acrobat.html ) which contains a copy of Adobe that will serve you just as well.

Read Chapter 9. This will teach you that on the one hand you need to understand text operators (Tj, ', ", TJ) and on the other hand you need to understand fonts and encodings.

Text operators are operators that you can intercept by adding "strings" to a PDF document; while all text operators should appear between BT and ET blocks, intercepting those BT and ET blocks by itself isn't going to do much for you, I guess.

Fonts are important because they will determine how the bytes used by these operators correspond to the actual (Unicode) characters. Therefore, if you want to get the value of the bytes received from the PDF file, you need to know how to use fonts to get that value.

Some additional points:

Don't assume that BT and ET correspond to an actual text block or paragraph, as you can tell from an application like InDesign or Word. One text block can contain an entire page or one character (or nothing).
There are also text state operators that control how text is displayed on the page. There are ways, for example, to attract invisible text; you may or may not want to extract this type of text. If you don't, you will need to maintain a fair amount of text state statements that you can tell the difference.

Not a small task :)

Update after viewing a sample PDF

Since the question has been clarified in the comments to indicate text extraction of a specific type of PDF file, let me add a little more information.

1) If you look at the PDF you are linking to, you cannot miss the font / encoding issue. The fonts in the sample PDF file are a subset, which means you don't have "cleartext" in the PDF page description, but instead indexes that must be rendered through the encoding of the fonts used to get meaningful text.

2) Extracting the text is possible if you look at the following output from pdfToolbox (warning, I'm pretty much related to this tool):

<page id="33">
    <words>
        <word txt="Senator">
            <parts>
                <part tlh="28.3481" tlv="868.534" trh="55.4455" trv="868.534" blh="28.3481" blv="859.902" brh="55.4455" brv="859.902"></part>
            </parts>
        </word>
        <word txt="House,">
            <parts>
                <part tlh="57.5305" tlv="868.534" trh="82.123" trv="868.534" blh="57.5305" blv="859.902" brh="82.123" brv="859.902"></part>
            </parts>
        </word>
        <word txt="85">
            <parts>
                <part tlh="84.208" tlv="868.534" trh="92.548" trv="868.534" blh="84.208" blv="859.902" brh="92.548" brv="859.902"></part>
                </parts>
        </word>

There are undoubtedly other tools that can give a similar (or better) result, so text extraction in itself should be doable.

The big problem is finding the text you are interested in in the correct order. The selection I used here gives the text of each word and its position (bounding box) on the page. When I go through the XML, when you get to the table, the problem is which text belongs to that table cell where rows and columns end, etc.

In a way, this problem is more complicated than the problem of simply defining lines of text, because you are dealing with a fairly dense table (and where my problem was mostly one-dimensional (putting everything on one line), this problem is two-dimensional.

Extract only text from PDF files using CGPDFScanner

More articles: