Converting data from PDFform to CSV
I am trying to convert data entered in multiple fillable pdf forms into one csv file.
This code consists of several steps:
- Open a new CSV file (header line)
- Open multiple PDF forms with a "for ... in" loop
- Convert data entered in form fields to csv
However, when I run the command, I get the error:
fc-int01-generateAppearances: None Traceback (most recent call last): File "C:\Python27\Scripts\test3.py", line 31, in <module> writer.writerow(value) _csv.Error: sequence expected
If I just print the value (form data) in python, it works. But there is no data import. There may be a problem with going from row to column with value. I hope I get it.
Here is my code:
import glob
import os
import sys
import csv
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
#input file path for specific file
#filename = "C:\Python27\Scripts\MH_1.pdf"
#fp = open(filename, 'rb')
#open new csv file
out_file=open('C:\Users\Wonen\Downloads\Test\output.csv', 'w+')
writer = csv.writer(out_file)
#header row
writer.writerow(('Name coordinator', 'Date', 'Address', 'District',
'City', 'Complaintnr'))
#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
fp = open(filename, 'rb')
#read pdf's
parser = PDFParser(fp)
doc = PDFDocument(parser)
#doc.initialize() # <<if password is required
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print '{0}: {1}'.format(name, value)
writer.writerow(value)
Output with pdf text (including all output) with print (repr(value))
:
None 'Crip Gang' None None None /Ja None /1 /1 None None /Ja /Ja None None None 'wfwf' 'sd' 'dfwf' 'ffasf' 'tsdbd' 'dfadfasdf' None 'df' None 'asdff' None 'wff' None 'ffs' None None None None None None None None None None None '1' '2' '7' /0 'Ja' 'Two unlimited' 'Captain Jack' None 'www.kijkbijmij.nl' 'Onderverhuur' /Ja
etc .. etc. "No" means "empty text box"; and "1" and "0" mean the outputs "yes" and "no".
source to share
Try changing the last part of the code as shown below:
.
.
.
#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
fp = open(filename, 'rb')
#read pdf's
parser = PDFParser(fp)
doc = PDFDocument(parser)
#doc.initialize() # <<if password is required
fields = resolve1(doc.catalog['AcroForm'])['Fields']
row = []
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
row.append(value)
writer.writerow(row)
out_file.close()
It is not clear that this will work, but it may provide you with the information you need to solve your problem.
One confusing thing is that for the first line of the csv header:
writer.writerow(('Name coordinator', 'Date', 'Address','District','City', 'Complaintnr'))
which determines how many field values โโwill be contained in each written line. This means it fields
must be a list of data for the 6 elements in that order.
You need to figure out how to translate what's in each group fields
into a list row
of 6 data items. This is what the code in my answer does - I think, but cannot verify.
source to share