Unable to read exact line-separated text

Question

Unable to read exact line-separated text

I am working on reading a highlighted PDF document using PDBox. I was able to read the selected text in one line with both single and multiple words. However, I could not read the selected text line by line. Please find the following code example to read the highlighted text.

PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
        for (int i = 0; i < allPages.size(); i++) {
            int pageNum = i + 1;
            PDPage page = (PDPage) allPages.get(i);
            List<PDAnnotation> la = page.getAnnotations();
            if (la.size() < 1) {
                continue;
            }
            System.out.println("Page number : "+pageNum);
            for (PDAnnotation pdfAnnot: la) {
                if (pdfAnnot.getSubtype().equals("Popup")) {
                    continue;
                }

                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDRectangle rect = pdfAnnot.getRectangle();
                float x = rect.getLowerLeftX() - 1;
                float y = rect.getUpperRightY() - 1;
                float width = rect.getWidth();
                float height = rect.getHeight() + rect.getHeight() / 4;

                int rotation = page.findRotation();
                if (rotation == 0) {
                    PDRectangle pageSize = page.getMediaBox();
                    y = pageSize.getHeight() - y;
                }

                Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                stripper.addRegion(Integer.toString(0), awtRect);
                stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
                System.out.println("Annot type = " + pdfAnnot.getSubtype());
                 System.out.println("Getting text from region = " + stripper.getTextForRegion(Integer.toString(0)) + "\n");
                 System.out.println("Getting text from comment = " + pdfAnnot.getContents());

            }
        }

code>

When reading the selected text line by line, the pdfAnnot.getRectangle () function returns the minimum area of the rectangle around the text. This produces more text than is required. I have not been able to find any API for getting the exact selected text.

For example: Text extracted from test PDF file.

Anyone, anywhere, can open a PDF file . All you need is free Adobe Acrobat

Reader . Recipients of other file formats sometimes cannot open files because they

do not have applications used to create documents.

Use case 1: Read the first bold text i.e. PDF . No problem when reading text selected on one line. Correct text will be printed like this:
Output: Retrieving text from region = " PDF "

Use case 2: Reading the second bold text, i.e. Adobe Acrobat reader that spans two lines. In this case, the extracted text when you run the above program:
Output: Retrieving text from region = " Anyone, anywhere can open a PDF file. All you need is a free Adobe Acrobat Reader. Recipients of other file formats sometimes cannot open files because what they are . "

The getRectangle () API gives the coordinates of the smallest rectangle surrounded by the selected text. Hence, it is more text than "Adobe Acrobat Reader".

How to find the start and end points of the selection in the selection area.
How to find out the number of lines in the extracted area.

Any help would be much appreciated.

+2

java pdf pdfbox text-extraction

user5342176 Sep 16 15 at 12:03

source to share

1 answer

Roham amini · Answer 1 · 2016-08-13T05:11:41+0000

I was able to extract the selected text using the following code.

// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
    ArrayList<String> highlightedTexts = new ArrayList<>();
    // this is the in-memory representation of the PDF document.
    // this will load a document from a file.
    PDDocument document = PDDocument.load(filePath);
    // this represents all pages in a PDF document.
    List<PDPage> allPages =  document.getDocumentCatalog().getAllPages();
    // this represents a single page in a PDF document.
    PDPage page = allPages.get(pageNumber);
    // get  annotation dictionaries
    List<PDAnnotation> annotations = page.getAnnotations();

    for(int i=0; i<annotations.size(); i++) {
        // check subType 
        if(annotations.get(i).getSubtype().equals("Highlight")) {
            // extract highlighted text
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();

            COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
            String str = null;

            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {

                COSFloat ULX = (COSFloat) quadsArray.get(0+k);
                COSFloat ULY = (COSFloat) quadsArray.get(1+k);
                COSFloat URX = (COSFloat) quadsArray.get(2+k);
                COSFloat URY = (COSFloat) quadsArray.get(3+k);
                COSFloat LLX = (COSFloat) quadsArray.get(4+k);
                COSFloat LLY = (COSFloat) quadsArray.get(5+k);
                COSFloat LRX = (COSFloat) quadsArray.get(6+k);
                COSFloat LRY = (COSFloat) quadsArray.get(7+k);

                k+=8;

                float ulx = ULX.floatValue() - 1;                           // upper left x.
                float uly = ULY.floatValue();                               // upper left y.
                float width = URX.floatValue() - LLX.floatValue();          // calculated by upperRightX - lowerLeftX.
                float height = URY.floatValue() - LLY.floatValue();         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = page.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripperByArea.addRegion("highlightedRegion", rectangle_2);
                stripperByArea.extractRegions(page);
                String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
        }
    }
    document.close();

    return highlightedTexts;
}

Unable to read exact line-separated text

More articles: