Python csv newline character in field

I have a problem reading a thorn-based delimited csv file that I think has a newline character in one of the fields. It forces the line over two lines, so I can't read the values ​​in the last fields of the line. I tried to open in new line mode

, but not sure what is the best way to do this.

This is how I am trying to read the file into python

:

csv.register_dialect('BB', delimiter='\xfe')
with open(file, 'rU') as file_in: 
    log=csv.reader(file_in, dialect='BB')
    for row in log:
        print row

      

This works great for most of the file, but there is a line that I assume has a newline character in one of the fields - I'm not sure how best to diagnose it. This is a screenshot of what the line looks like in notepad, as you can see that it forces the line to two lines when it should look like two lines below. enter image description here

Assuming this with csv.reader

, the line looks like this:

['06 -13-2015-10: 13: 41 ',' 0 ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' 142 ',' ',' 5 ',' 7.0 ',' 2 ',' ',' cmhkl966 ',' amex_674 ',' 1 ',' 0.00 ',' ',' ',' '"]

i.e. truncated at this first apostrophe.

+3


source to share


2 answers


I have shortened your problem a bit (I hope I understood the cause of the problem):

import io
import csv

file_in = io.StringIO('''
aþbþ'hello
world'
''')

log=csv.reader(file_in, delimiter='\xfe', quotechar="'")
for row in log:
    print(row)

      

output:

['a', 'b', 'hello\nworld']

      


UPDATE:



as pointed out in the comments: here's the version that is .csv

read from the file. content test.csv

:

aþbþ'hello
world'þc
dþeþ'hello
other
things'þf
gþhþiþj

      

and python code:

import csv
from pathlib import Path

HERE = Path(__file__).parent
DATA_PATH = HERE / '../data/test.csv'

with DATA_PATH.open('rU') as file_in:
    log=csv.reader(file_in, delimiter='\xfe', quotechar="'")
    for row in log:
        print(row)

      

which outputs:

['a', 'b', 'hello\nworld', 'c']
['d', 'e', 'hello\nother\nthings', 'f']
['g', 'h', 'i', 'j']

      

0


source


You can also check if the first element of the next line starts with a timestamp, and if not, use the list function extend

to add it to the content of the current line before printing.

Disclaimer: Not Verified



import re

csv.register_dialect('BB', delimiter='\xfe')
with open(file, 'rU') as file_in: 
    log=csv.reader(file_in, dialect='BB')
    for i in range(0, len(log) - 1):
        if re.search('\d+-\d+-\d+-\d+:\d+:\d+', log[i+1][0]) is None:
            i.extend(log[i+1])
        print i

      

0


source







All Articles