How to extract the contents of a table into a pdf file?

I want to extract the contents of a table in pdf format, for example:

enter image description here

I wrote this java program using iText java PDF libray which can read the contents of a PDF file line by line but I dont know how to get the contents of a table

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class PDFReader {

    public static void main(String[] args) {

        // TODO, add your application code
        System.out.println("Lecteur PDF");
        System.out.println (ReadPDF("D:/test.pdf"));
    }
        private static String ReadPDF(String pdf_url)
    {
        StringBuilder str=new StringBuilder();
        try
        {

         PdfReader reader = new PdfReader(pdf_url);
        int n = reader.getNumberOfPages();
         for(int i=1;i<n;i++)
         {
            String str2=PdfTextExtractor.getTextFromPage(reader, i);
            str.append(str2);
           System.out.println(str);
         }
        }catch(Exception err)
        {
            err.printStackTrace();
        }
        return String.format("%s", str);
    }
}

      

this is what i get:

enter image description here

but this is not what I want, I want to extract the contents of a table row by row and column by column, for example store each row in a java array

the first array will contain: "N °", "DATE ​​OBERVATIONS", "TEXTE"

the second array will contain: "029/14", "Le 1er sept 2014 remune AVURNAV ...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav ..."

the third array will contain: "037/14", "Le 15 October 2014 remune AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."

etc.

thank

+3


source to share


1 answer


You may need to identify common field start / end character sequences to split your data into an array if your PDF library does not support table extraction. For example, the first fields nnn/nn

, the second field ends nnnn/nn

, and the third field ends where the next first field begins.



This is a tricky issue - I've had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting letter positions as well as the actual text.

+1


source







All Articles