How to extract the contents of a table into a pdf file?

Question

How to extract the contents of a table into a pdf file?

I want to extract the contents of a table in pdf format, for example:

enter image description here

I wrote this java program using iText java PDF libray which can read the contents of a PDF file line by line but I dont know how to get the contents of a table

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class PDFReader {

    public static void main(String[] args) {

        // TODO, add your application code
        System.out.println("Lecteur PDF");
        System.out.println (ReadPDF("D:/test.pdf"));
    }
        private static String ReadPDF(String pdf_url)
    {
        StringBuilder str=new StringBuilder();
        try
        {

         PdfReader reader = new PdfReader(pdf_url);
        int n = reader.getNumberOfPages();
         for(int i=1;i<n;i++)
         {
            String str2=PdfTextExtractor.getTextFromPage(reader, i);
            str.append(str2);
           System.out.println(str);
         }
        }catch(Exception err)
        {
            err.printStackTrace();
        }
        return String.format("%s", str);
    }
}

this is what i get:

enter image description here

but this is not what I want, I want to extract the contents of a table row by row and column by column, for example store each row in a java array

the first array will contain: "N °", "DATE OBERVATIONS", "TEXTE"

the second array will contain: "029/14", "Le 1er sept 2014 remune AVURNAV ...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav ..."

the third array will contain: "037/14", "Le 15 October 2014 remune AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."

etc.

thank

+3

java pdf itext text-extraction

Bertrand 09 jul. '15 at 22:00

source to share

1 answer

3-14159265358979323846264 · Answer 1 · 2015-07-09T22:08:12+0000

You may need to identify common field start / end character sequences to split your data into an array if your PDF library does not support table extraction. For example, the first fields nnn/nn

, the second field ends nnnn/nn

, and the third field ends where the next first field begins.

This is a tricky issue - I've had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting letter positions as well as the actual text.

How to extract the contents of a table into a pdf file?

More articles: