PdfTextExtractor.GetTextFromPage suddenly gives an empty string

We've been using the iTextSharp libraries in the SSIS process for several years now to read some values ​​from a set of PDF exams. Everything works well until this week when we get an empty string returned when calling the PdfTextExtractor.GetTextFromPage method. Here I will put the code:

    // Read the data from the blob column where the PDF exists
    byte[] byteBuffer = Row.FileData.GetBlobData(0, (int)Row.FileData.Length);

    using (var pdfReader = new PdfReader(byteBuffer))
    {

        // Here is the important stuff
        var extractStrategy = new LocationTextExtractionStrategy();

        // This call will extract the page with the proper data on it depending on the exam type
        // 1-page exams = NBOME - need to read first page for exam result data
        // 2-page exams = NBME - need to read second page for exam result data
        // The next two statements utilize this construct.
        var vendor = pdfReader.NumberOfPages == 1 ? "NBOME" : "NBME";

        *** THIS NEXT LINE GIVES THE EMPTY STRING
        var newText = PdfTextExtractor.GetTextFromPage(pdfReader, pdfReader.NumberOfPages == 1 ? 1 : 2, extractStrategy);

        var stringList = newText.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);

        var fileParser = FileParseFactory.GetFileParse(stringList, vendor);

        // Populate our output variables
        Row.ParsedExamName = fileParser.GetExamName(stringList);
        Row.DateParsed = DateTime.Now;
        Row.ParsedId = fileParser.GetStudentId(stringList);
        Row.ParsedTestDate = fileParser.GetTestDate(stringList);
        Row.ParsedTestDateString = fileParser.GetTestDateAsString(stringList);
        Row.ParsedName = fileParser.GetStudentName(stringList);
        Row.ParsedTotalScore = fileParser.GetTestScore(stringList);
        Row.ParsedVendor = vendor;
    }

      

This is not for all PDF files, by the way. To explain more, we read in the exam files. One type of exam (NBME) seems to read just fine. However, there is no other type (NBOME). However, until this week, NBOME has been well read.

This leads me to think that this is an internal change to the PDF file format.

Also, another bit of information is that the actual pdfReader has data - I can get the byte [] data array, but calling get any text just gives me blank.

I wish I could show any exam data or files - this information is sensitive.

Has anyone seen something like this? If so, any possible solutions?

+3


source to share


1 answer


Well, we found our answer. The user originally went to the NBOME website and uploaded PDF exam results files to import into my parsing system. As I said, this worked for quite a while. However, recently (this week) the user started not to download the files but used the PDF print function and printed PDF files in PDF format. When she did this, a problem arose.

At the bottom, it looks like PDF printing to PDF might have injected some characters or something under the covers, causing the PDF reading via iTextSharp to not interrupt but result in an empty line. She should have just continued downloading directly.



Thanks to those who suggested some comments!

+1


source







All Articles