Unable to read exact line-separated text
I am working on reading a highlighted PDF document using PDBox. I was able to read the selected text in one line with both single and multiple words. However, I could not read the selected text line by line. Please find the following code example to read the highlighted text.
PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println("Page number : "+pageNum);
for (PDAnnotation pdfAnnot: la) {
if (pdfAnnot.getSubtype().equals("Popup")) {
continue;
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDRectangle rect = pdfAnnot.getRectangle();
float x = rect.getLowerLeftX() - 1;
float y = rect.getUpperRightY() - 1;
float width = rect.getWidth();
float height = rect.getHeight() + rect.getHeight() / 4;
int rotation = page.findRotation();
if (rotation == 0) {
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
}
Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion(Integer.toString(0), awtRect);
stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
System.out.println("Annot type = " + pdfAnnot.getSubtype());
System.out.println("Getting text from region = " + stripper.getTextForRegion(Integer.toString(0)) + "\n");
System.out.println("Getting text from comment = " + pdfAnnot.getContents());
}
}
code>
When reading the selected text line by line, the pdfAnnot.getRectangle () function returns the minimum area of ββthe rectangle around the text. This produces more text than is required. I have not been able to find any API for getting the exact selected text.
For example: Text extracted from test PDF file.
Anyone, anywhere, can open a PDF file . All you need is free Adobe Acrobat
Reader . Recipients of other file formats sometimes cannot open files because they
do not have applications used to create documents.
Use case 1: Read the first bold text i.e. PDF . No problem when reading text selected on one line. Correct text will be printed like this:
Output: Retrieving text from region = " PDF "
Use case 2: Reading the second bold text, i.e. Adobe Acrobat reader that spans two lines. In this case, the extracted text when you run the above program:
Output: Retrieving text from region = " Anyone, anywhere can open a PDF file. All you need is a free Adobe Acrobat Reader. Recipients of other file formats sometimes cannot open files because what they are . "
The getRectangle () API gives the coordinates of the smallest rectangle surrounded by the selected text. Hence, it is more text than "Adobe Acrobat Reader".
- How to find the start and end points of the selection in the selection area.
- How to find out the number of lines in the extracted area.
Any help would be much appreciated.
source to share
I was able to extract the selected text using the following code.
// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
ArrayList<String> highlightedTexts = new ArrayList<>();
// this is the in-memory representation of the PDF document.
// this will load a document from a file.
PDDocument document = PDDocument.load(filePath);
// this represents all pages in a PDF document.
List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
// this represents a single page in a PDF document.
PDPage page = allPages.get(pageNumber);
// get annotation dictionaries
List<PDAnnotation> annotations = page.getAnnotations();
for(int i=0; i<annotations.size(); i++) {
// check subType
if(annotations.get(i).getSubtype().equals("Highlight")) {
// extract highlighted text
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
String str = null;
for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {
COSFloat ULX = (COSFloat) quadsArray.get(0+k);
COSFloat ULY = (COSFloat) quadsArray.get(1+k);
COSFloat URX = (COSFloat) quadsArray.get(2+k);
COSFloat URY = (COSFloat) quadsArray.get(3+k);
COSFloat LLX = (COSFloat) quadsArray.get(4+k);
COSFloat LLY = (COSFloat) quadsArray.get(5+k);
COSFloat LRX = (COSFloat) quadsArray.get(6+k);
COSFloat LRY = (COSFloat) quadsArray.get(7+k);
k+=8;
float ulx = ULX.floatValue() - 1; // upper left x.
float uly = ULY.floatValue(); // upper left y.
float width = URX.floatValue() - LLX.floatValue(); // calculated by upperRightX - lowerLeftX.
float height = URY.floatValue() - LLY.floatValue(); // calculated by upperRightY - lowerLeftY.
PDRectangle pageSize = page.getMediaBox();
uly = pageSize.getHeight() - uly;
Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
stripperByArea.addRegion("highlightedRegion", rectangle_2);
stripperByArea.extractRegions(page);
String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");
if(j > 1) {
str = str.concat(highlightedText);
} else {
str = highlightedText;
}
}
highlightedTexts.add(str);
}
}
document.close();
return highlightedTexts;
}
source to share