How to parse a tab-delimited file that can have values across multiple lines?

Question

How to parse a tab-delimited file that can have values across multiple lines?

I have a file that has a tab with different data points:

"ID"    "Value"
"1" "This is a value"

I can easily extract data from this just by using the built-in str function split

. However, there are times when I run into this:

"ID"    "Value"
"1" "This is a value"
"2" "This is another
value"
"3" "Just one more"

Where the second value runs over multiple lines. How can I capture each data point in its entirety?

Ultimately what I want is a list of dictionaries like this:

[{'ID':'1', 'Value':'This is a value'}, {'ID':'2', 'Value':'This is another\nvalue'}, {'ID':'3', 'Value':'Just one more'}]

+3

python parsing multiline

KronoS 06 Aug 14 at 12:16 am

source to share

2 answers

When iterating line by line, you have two possibilities: in the default case, you are reading a new record, so you just need to handle it as you would without multiline. Another case is when the previous line has not finished recording, i.e. When it didn't end with a quote. In this case, you are still adding to the previous entry. Therefore, you just need to monitor the status of the previous record, the record itself, to analyze your file.

Something like that:

isNew = True
records = []
for line in file:
    if isNew:
        records.append(line.strip().split('\t'))
    else:
        records[-1][-1] += '\n' + line
    isNew = records[-1][-1].endswith('"')

+1

poke 06 Aug 14 at 12:24 am

source to share

spicavigo · Accepted Answer · 2014-08-06T00:29:08+0000

import csv
r=csv.reader(open("a.tsv"), delimiter="\t", quotechar='"')
print r.next()

Below is a sample execution http://codebunk.com/b/4095452/

How to parse a tab-delimited file that can have values ​​across multiple lines?

More articles:

How to parse a tab-delimited file that can have values across multiple lines?