Converting data from PDFform to CSV

Question

Converting data from PDFform to CSV

I am trying to convert data entered in multiple fillable pdf forms into one csv file.
This code consists of several steps:

Open a new CSV file (header line)
Open multiple PDF forms with a "for ... in" loop
Convert data entered in form fields to csv

However, when I run the command, I get the error:

fc-int01-generateAppearances: None
Traceback (most recent call last):
    File "C:\Python27\Scripts\test3.py", line 31, in <module>
        writer.writerow(value)
    _csv.Error: sequence expected

If I just print the value (form data) in python, it works. But there is no data import. There may be a problem with going from row to column with value. I hope I get it.

Here is my code:

import glob
import os
import sys
import csv
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

#input file path for specific file
#filename = "C:\Python27\Scripts\MH_1.pdf"
#fp = open(filename, 'rb')

#open new csv file
out_file=open('C:\Users\Wonen\Downloads\Test\output.csv', 'w+')
writer = csv.writer(out_file)
#header row
writer.writerow(('Name coordinator', 'Date', 'Address', 'District',
                 'City', 'Complaintnr'))

#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        print '{0}: {1}'.format(name, value)
        writer.writerow(value)

Output with pdf text (including all output) with print (repr(value))

:

None
'Crip Gang'
None
None
None
/Ja
None
/1
/1
None
None
/Ja
/Ja
None
None
None
'wfwf'
'sd'
'dfwf'
'ffasf'
'tsdbd'
'dfadfasdf'
None
'df'
None
'asdff'
None
'wff'
None
'ffs'
None
None
None
None
None
None
None
None
None
None
None
'1'
'2'
'7'
/0
'Ja'
'Two unlimited'
'Captain Jack'
None
'www.kijkbijmij.nl'
'Onderverhuur'
/Ja

etc .. etc. "No" means "empty text box"; and "1" and "0" mean the outputs "yes" and "no".

+3

python python-2.7 pdf csv pdf-form

Readazoid Jul 20. 15 at 16:15

source to share

1 answer

martineau · Accepted Answer · 2015-07-22T09:25:38+0000

Try changing the last part of the code as shown below:

    .
    .
    .
#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    row = []
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        row.append(value)
    writer.writerow(row)

out_file.close()

It is not clear that this will work, but it may provide you with the information you need to solve your problem.

One confusing thing is that for the first line of the csv header:

writer.writerow(('Name coordinator', 'Date', 'Address','District','City', 'Complaintnr'))

which determines how many field values will be contained in each written line. This means it fields

must be a list of data for the 6 elements in that order.

You need to figure out how to translate what's in each group fields

into a list row

of 6 data items. This is what the code in my answer does - I think, but cannot verify.

Converting data from PDFform to CSV

More articles: