How to extract high level text from PDF using iTextSharp?

Question

How to extract high level text from PDF using iTextSharp?

According to the following post: iTextSharp PDF Reading high level text (annotation highlighting) using C #

this code:

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

is working on extracting PDF annotations. But why the same code doesn't work for highlighting (specifically, PdfName.HIGHLIGHT doesn't work):

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

+3

.net pdf itextsharp

John stevensons 30 oct. 14 at 11:59

source to share

2 answers

Here is a complete example of retrieving the selected text using itextSharp

    public void GetRectAnno()
    {

        string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

        string filePath = appRootDir + "/PDFs/" + "anot.pdf";

        int pageFrom = 0;
        int pageTo = 0;

        try
        {
            using (PdfReader reader = new PdfReader(filePath))
            {
                pageTo = reader.NumberOfPages;

                for (int i = 1; i <= reader.NumberOfPages; i++)
                {


                    PdfDictionary page = reader.GetPageN(i);
                    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                    if (annots != null)
                        foreach (PdfObject annot in annots.ArrayList)
                        {

                            //Get Annotation from PDF File
                            PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                            PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                            //check only subtype is highlight
                            if (subType.Equals(PdfName.HIGHLIGHT))
                            {
                                 // Get Quadpoints and Rectangle of highlighted text
                                Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                                //Extract Text using rectangle strategy    
                                PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);

                                Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                                float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));



                                RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                                ITextExtractionStrategy strategy;
                                StringBuilder sb = new StringBuilder();


                                strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                                sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));

                                //Show extract text on Console
                                Console.WriteLine(sb.ToString());
                                //Console.WriteLine("Page No" + i);

                            }



                        }



                }
            }
        }
        catch (Exception ex)
        {
        }
    }

+2

Hassan Nazeer 07 jan. 16 at 12:42

source to share

Bruno lowagie · Accepted Answer · 2014-10-30T12:49:28+0000

Please take a look at Table 30 in ISO-32000-1 (also link to PDF). It is called "Entries in the Page Object". Among these entries, you can find a key named Annots

. Its meaning:

(Optional) An array of annotation dictionaries that must contain indirect references to all annotations associated with the page (see 12.5, Annotations).

You won't find an entry with a key, for example Highlight

, so it's perfectly okay for the returned array to be null if you have this string:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

You need to get annotations like you already did:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

Now you need to loop over this array and look for annotations with Subtype

equal Highlight

. This type of annotation is shown in ISO-32000-1 table 169 entitled “Types of annotation”.

In other words, your assumption that the page dictionary contains key entries is Highlight

incorrect, and if you read the entire spec, you will also find another false assumption that you made. You are mistakenly thinking that the selected text is saved in the Contents

annotation record . This shows a lack of understanding of the nature of annotations and page content.

The text you are looking for is stored in the page content stream. Page content flow is independent of page annotations. Hence, to get the selected text, you need to get the coordinates stored in the annotation Highlight

(stored in the array QuadPoints

), and you need to use those coordinates to parse the text present in the page content to those coordinates.

How to extract high level text from PDF using iTextSharp?

More articles: