How to read a text file where some of the content has line breaks?
I have a text file of this form:
06/01/2016, 10:40 pm - abcde 07/01/2016, 12:04 pm - abcde 07/01/2016, 12:05 pm - abcde 07/01/2016, 12:05 pm - abcde 07/01/2016, 6:14 pm - abcde fghe 07/01/2016, 6:20 pm - abcde 07/01/2016, 7:58 pm - abcde fghe ijkl 07/01/2016, 7:58 pm - abcde
You can see that each line is separated by a line break, but there are line breaks in some of the content of the line. Thus, a simple line splitting does not parse each line properly.
As an example, for the 5th entry, I want my output to be 07/01/2016, 6:14 pm - abcde fghe
Here is my current code:
with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
row = line.strip()
data.append(row)
source to share
Given your example input, you can use a direct lookup regex :
pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
with open (fn) as f:
pprint([m.group(1) for m in pat.finditer(f.read())])
Printing
['06/01/2016, 10:40 pm - abcde\n',
'07/01/2016, 12:04 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 6:14 pm - abcde\n\nfghe\n',
'07/01/2016, 6:20 pm - abcde\n',
'07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
'07/01/2016, 7:58 pm - abcde\n']
With the example, Dropbox prints:
['11/11/2015, 3:16 pm - IK: 12\n',
'13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n',
'13/11/2015, 12:11 pm - IK: Boo\n',
'15/11/2015, 8:36 pm - IR: Root\n',
'15/11/2015, 8:36 pm - IR: LaTeX?\n',
'15/11/2015, 8:43 pm - IK: Ws\n']
If you want to remove \n
what has been captured, just add m.group(1).strip().replace('\n', '')
to the list described above.
Explanation of regex:
^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)
^ start of line
^ ^ ^ ^ ^ pattern for a date
^ capture the rest...
^ until (look ahead)
^ ^ ^ another date
^ or
^ end of string
source to share
Given that it ','
can only appear as a separator, we can check if there is a comma on that line and concatenate it with the last line if it isn't:
data = []
with open('file.txt', 'r') as text_file:
for line in text_file:
row = line.strip()
if ',' not in row:
data[-1] += '\n' + row
else:
data.append(row)
source to share
You can use regular expressions (using a re
module) to check dates:
import re
with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
row = line.strip()
if re.match(r'\d{2}/\d{2}/\d{4}.*'):
data.append(row) # date: new record
else:
data[-1] += '\n' + row # no date: append to last record
# '\d{2}': two digits
# '.*': any character, zero or more times
source to share
Simple length testing:
#!python3
#coding=utf-8
data = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde"""
lines = data.split("\n")
out = []
for l in lines:
c = l.strip()
if c:
if len(c) < 10:
out[-1] += c
else:
out.append(c)
#skip empty
for o in out:
print(o)
leads to:
06/01/2016, 10:40 pm - abcde 07/01/2016, 12:04 pm - abcde 07/01/2016, 12:05 pm - abcde 07/01/2016, 12:05 pm - abcde 07/01/2016, 6:14 pm - abcdefghe 07/01/2016, 6:20 pm - abcde 07/01/2016, 7:58 pm - abcdefgheijkl 07/01/2016, 7:58 pm - abcde
Does not contain line breaks in the data!
But this one-line regex should do this (break lines per line followed by a digit) at least for sampled data (breaks when the data contains a linebreak line followed by a digit):
#!python3
#coding=utf-8
text_file = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde"""
import re
data = re.split("\n(?=\d)", text_file)
print(data)
for d in data:
print(d)
Output:
['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde']
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde
(fixed with view)
source to share