How to extract the contents of a table into a pdf file?
I want to extract the contents of a table in pdf format, for example:
I wrote this java program using iText java PDF libray which can read the contents of a PDF file line by line but I dont know how to get the contents of a table
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class PDFReader {
public static void main(String[] args) {
// TODO, add your application code
System.out.println("Lecteur PDF");
System.out.println (ReadPDF("D:/test.pdf"));
}
private static String ReadPDF(String pdf_url)
{
StringBuilder str=new StringBuilder();
try
{
PdfReader reader = new PdfReader(pdf_url);
int n = reader.getNumberOfPages();
for(int i=1;i<n;i++)
{
String str2=PdfTextExtractor.getTextFromPage(reader, i);
str.append(str2);
System.out.println(str);
}
}catch(Exception err)
{
err.printStackTrace();
}
return String.format("%s", str);
}
}
this is what i get:
but this is not what I want, I want to extract the contents of a table row by row and column by column, for example store each row in a java array
the first array will contain: "N °", "DATE OBERVATIONS", "TEXTE"
the second array will contain: "029/14", "Le 1er sept 2014 remune AVURNAV ...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav ..."
the third array will contain: "037/14", "Le 15 October 2014 remune AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."
etc.
thank
source to share
You may need to identify common field start / end character sequences to split your data into an array if your PDF library does not support table extraction. For example, the first fields nnn/nn
, the second field ends nnnn/nn
, and the third field ends where the next first field begins.
This is a tricky issue - I've had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting letter positions as well as the actual text.
source to share