How to parse a tab-delimited file that can have values ​​across multiple lines?

I have a file that has a tab with different data points:

"ID"    "Value"
"1" "This is a value"

      

I can easily extract data from this just by using the built-in str function split

. However, there are times when I run into this:

"ID"    "Value"
"1" "This is a value"
"2" "This is another
value"
"3" "Just one more"

      

Where the second value runs over multiple lines. How can I capture each data point in its entirety?

Ultimately what I want is a list of dictionaries like this:

[{'ID':'1', 'Value':'This is a value'}, {'ID':'2', 'Value':'This is another\nvalue'}, {'ID':'3', 'Value':'Just one more'}]

      

+3


source to share


2 answers


import csv
r=csv.reader(open("a.tsv"), delimiter="\t", quotechar='"')
print r.next()

      



Below is a sample execution http://codebunk.com/b/4095452/

+6


source


When iterating line by line, you have two possibilities: in the default case, you are reading a new record, so you just need to handle it as you would without multiline. Another case is when the previous line has not finished recording, i.e. When it didn't end with a quote. In this case, you are still adding to the previous entry. Therefore, you just need to monitor the status of the previous record, the record itself, to analyze your file.

Something like that:



isNew = True
records = []
for line in file:
    if isNew:
        records.append(line.strip().split('\t'))
    else:
        records[-1][-1] += '\n' + line
    isNew = records[-1][-1].endswith('"')

      

+1


source







All Articles