How to read a text file where some of the content has line breaks?

Question

How to read a text file where some of the content has line breaks?

I have a text file of this form:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

You can see that each line is separated by a line break, but there are line breaks in some of the content of the line. Thus, a simple line splitting does not parse each line properly.

As an example, for the 5th entry, I want my output to be 07/01/2016, 6:14 pm - abcde fghe

Here is my current code:

with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
    row = line.strip()
    data.append(row)

+3

python

Imran May 07 '17 at 18:47

source to share

4 answers

Given that it ','

can only appear as a separator, we can check if there is a comma on that line and concatenate it with the last line if it isn't:

data = []

with open('file.txt', 'r') as text_file:
    for line in text_file:
        row = line.strip()
        if ',' not in row:
            data[-1] += '\n' + row
        else:
            data.append(row)

+1

Hugo sadok 07 May '17 at 19:00

source to share

You can use regular expressions (using a re

module) to check dates:

import re
with open('file.txt', 'r') as text_file:
  data = []
  for line in text_file:
    row = line.strip()
    if re.match(r'\d{2}/\d{2}/\d{4}.*'):  
      data.append(row)  # date: new record
    else:
      data[-1] += '\n' + row  # no date: append to last record

# '\d{2}': two digits
# '.*': any character, zero or more times

0

schwobaseggl May 07 '17 at 18:58

source to share

Simple length testing:

#!python3
#coding=utf-8

data = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

lines = data.split("\n")
out = []
for l in lines:
    c = l.strip()
    if c:
        if len(c) < 10:
            out[-1] += c
        else:
            out.append(c)
    #skip empty

for o in out:
    print(o)

leads to:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcdefghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcdefgheijkl
07/01/2016, 7:58 pm - abcde

Does not contain line breaks in the data!

But this one-line regex should do this (break lines per line followed by a digit) at least for sampled data (breaks when the data contains a linebreak line followed by a digit):

#!python3
#coding=utf-8

text_file = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

import re
data = re.split("\n(?=\d)", text_file)

print(data)

for d in data:
    print(d)

Output:

   ['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde']
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

(fixed with view)

0

handle May 07 '17 at 19:14

source to share

dawg · Accepted Answer · 2017-05-07T19:16:45+0000

Given your example input, you can use a direct lookup regex :

pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)

with open (fn) as f:
    pprint([m.group(1) for m in pat.finditer(f.read())])

Printing

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

With the example, Dropbox prints:

['11/11/2015, 3:16 pm - IK: 12\n',
 '13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n',
 '13/11/2015, 12:11 pm - IK: Boo\n',
 '15/11/2015, 8:36 pm - IR: Root\n',
 '15/11/2015, 8:36 pm - IR: LaTeX?\n',
 '15/11/2015, 8:43 pm - IK: Ws\n']

If you want to remove \n

what has been captured, just add m.group(1).strip().replace('\n', '')

to the list described above.

Explanation of regex:

^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)

^                                                       start of line   
    ^  ^  ^  ^   ^                                      pattern for a date  
                       ^                                capture the rest...  
                           ^                            until (look ahead)
                                      ^ ^ ^             another date
                                                  ^     or
                                                     ^  end of string

How to read a text file where some of the content has line breaks?

More articles: