Java: Apache PDFbox Extract selected text

Question

Java: Apache PDFbox Extract selected text

I am using Apache PDFbox library to extract selected text (i.e. with yellow background) from a PDF file. I am completely new to this library and do not know which class from it can be used for this purpose. So far I have been doing text extraction from comments using the code below.

PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
    List allPages = pddDocument.getDocumentCatalog().getAllPages();
    for (int i = 0; i < allPages.size(); i++) {
    int pageNum = i + 1;
    PDPage page = (PDPage) allPages.get(i);
    List<PDAnnotation> la = page.getAnnotations();
    if (la.size() < 1) {
    continue;
    }
    System.out.println("Total annotations = " + la.size());
    System.out.println("\nProcess Page " + pageNum + "...");
    // Just get the first annotation for testing
    PDAnnotation pdfAnnot = la.get(0); 
    System.out.println("Getting text from comment = " + pdfAnnot.getContents());

Now I need to get the selected text, any code example would be much appreciated.

+3

java pdf pdfbox

Abid khan 21 oct. At 7:51 am

source to share

3 answers

Hope this answer helps anyone else who is facing the same problem.

// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
    ArrayList<String> highlightedTexts = new ArrayList<>();
    // this is the in-memory representation of the PDF document.
    // this will load a document from a file.
    PDDocument document = PDDocument.load(filePath);
    // this represents all pages in a PDF document.
    List<PDPage> allPages =  document.getDocumentCatalog().getAllPages();
    // this represents a single page in a PDF document.
    PDPage page = allPages.get(pageNumber);
    // get  annotation dictionaries
    List<PDAnnotation> annotations = page.getAnnotations();

    for(int i=0; i<annotations.size(); i++) {
        // check subType 
        if(annotations.get(i).getSubtype().equals("Highlight")) {
            // extract highlighted text
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();

            COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
            String str = null;

            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {

                COSFloat ULX = (COSFloat) quadsArray.get(0+k);
                COSFloat ULY = (COSFloat) quadsArray.get(1+k);
                COSFloat URX = (COSFloat) quadsArray.get(2+k);
                COSFloat URY = (COSFloat) quadsArray.get(3+k);
                COSFloat LLX = (COSFloat) quadsArray.get(4+k);
                COSFloat LLY = (COSFloat) quadsArray.get(5+k);
                COSFloat LRX = (COSFloat) quadsArray.get(6+k);
                COSFloat LRY = (COSFloat) quadsArray.get(7+k);

                k+=8;

                float ulx = ULX.floatValue() - 1;                           // upper left x.
                float uly = ULY.floatValue();                               // upper left y.
                float width = URX.floatValue() - LLX.floatValue();          // calculated by upperRightX - lowerLeftX.
                float height = URY.floatValue() - LLY.floatValue();         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = page.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripperByArea.addRegion("highlightedRegion", rectangle_2);
                stripperByArea.extractRegions(page);
                String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
        }
    }
    document.close();

    return highlightedTexts;
}

+6

Roham amini 13 Aug 16 at 5:19 am

source to share

COSArray quadsArray =  ((COSArray) annotations.get(i)).getCOSObject().getDictionaryObject(COSName.getPDFName("Rosen Capital Advisors Fact Sheet (2019_02).pdf"));

Getting error - method getDictionaryObject(COSName)

not defined for type COSBase

Can someone please give me a solution. I am using pdfbox 2.0.16

0

Yogesh Mallik 03 jul. At 10:16

source to share

mkl · Accepted Answer · 2015-10-22T10:05:42+0000

The code in the question Can't read the exact line-by-line text already illustrates most of the concepts used to extract text from content-restricted regions on a page with PDFBox.

After examining this code, the OP was still wondering in the comment:

But I am confused by one thing QuadPoints instead of Rect . as you mentioned there in the comment. What is it, can you explain it with some lines of code or in simple words as I am also facing the same problem of multi-line strings?

In general, the area to which the annotation belongs is a rectangle:

Straight rectangle (required) An annotation rectangle that defines the location of the annotation on the page in default user space units.

(from Table 164 - Entries Common to All Annotation Dictionaries - in ISO 32000-1)

For some types of annotations (e.g. text markup) this location value is not sufficient because:

markup text can be written at some odd angle, but the type of rectangle mentioned in the spec refers to rectangles with edges parallel to the edges of the page; and
the markup text can start anywhere on one line and end anywhere on the other, so the markup area is not rectangular, but the union of several rectangular pieces.

Thus, to deal with these types of annotations, the PDF specification provides a more general way of defining scopes:

QuadPoints array (required) An 8 × n array specifying the default coordinates of n quadrilaterals in user space. Each quadrangle must contain a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral must be given in order

x ₁ y ₁ x ₂ y ₂ x ₃ y ₃ x ₄ y ₄

by specifying a quadrilateral four vertices in counterclockwise order (see Figure 64). The text should be oriented relative to the edge connection points (x₁, y ₁) and (x ₂, y _2sub>).

(from table 179 - additional entries specific to text markup annotations - in ISO 32000-1)

Thus, instead of the rectangle given by

PDRectangle rect = pdfAnnot.getRectangle();

in the code in the question asked, you need to consider the four level options given by

COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));

and define the areas for PDFTextStripperByArea stripper

accordingly. Unfortunately PDFTextStripperByArea.addRegion

expects a rectangle as a parameter, not some common quad. Since text is usually printed horizontally or vertically, this shouldn't be too much of a problem.

PS One word of warning about the QuadPoints specification , the order may differ in actual PDF files, cf. issue a PDF Spec vs Acrobat Conditions creation (QuadPoints) .

Java: Apache PDFbox Extract selected text

More articles: