Parsing an unstructured text file with Python
I have a text file, several snippets of which look like this:
Page 1 of 515
Closing Report for Company Name LLC
222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..
Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets@yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:
(should be 2 spaces here.. this is not in normal text file)
Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson
242 N 9th Street, #100 & 200
File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%
Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia@yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation@yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama@cylon.net
Here's my code:
import re
file = open('file-path.txt','r')
# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item
Here's the result:
['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']
My problems are as follows:
- First, there should be 2 entries as output and I am only outputting them.
- In the top block of text, my script doesn't know where the previous value ends and a new one starts: "Status: Fell Thru" should be one value "Primary Closing Side:", "Buyer" Acceptance: 01/10/2010 ',' Closing Date : 01/10/2010 ',' Property Type: RRR ',' MLS Number: ',' Sale Price: $ 299,000 ',' Commission: 33.00% 'are to be caught.
- After correct analysis, I will need to parse again to separate keys from values ββ(eg Closed Date: 01/10/2010), ideally in a dicts list.
I can't think of a better way other than using regular expressions to select keys and then grab the chunks of text that follow.
When done, I need a key-filled csv w / header line that I can import into pandas w / read_csv. I've spent quite a few hours on this.
source to share
(This is not a complete answer, but it is too long for a comment.)
- Field names can have spaces (for example
MLS number
) - Multiple fields may appear on each line (for example
Home: Pager:
) - The Notes box has time in it, with
:
in it
This means that you cannot use your own approach to determine field names from a regular expression. It is not possible for it to know if "MLS" is part of a previous data value or a subsequent field name.
Some of the lines Home: Pager:
refer to the Seller, some to the Buyer, Tenant's Agent or Leasing Agent. This means that the naive step-by-step approach I take below doesn't work either.
This is the code I was working on, it works against your test data, but gives the wrong output due to the above. Here's for a link to the approach I was taking:
replaces = [
('Closing Report for', 'Report_for:')
,('Report for', 'Report_for:')
,('File number', 'File_number')
,('Primary closing party', 'Primary_closing_party')
,('MLS number', 'MLS_number')
,('Sale Price', 'Sale_Price')
,('Account owner', 'Account_owner')
# ...
# etc.
]
def fix_linemash(data):
# splits many fields on one line into several lines
results = []
mini_collection = []
for token in data.split(' '):
if ':' not in token:
mini_collection.append(token)
else:
results.append(' '.join(mini_collection))
mini_collection = [token]
return [line for line in results if line]
def process_record(data):
# takes a collection of lines
# fixes them, and builds a record dict
record = {}
for old, new in replaces:
data = data.replace(old, new)
for line in fix_linemash(data):
print line
name, value = line.split(':', 1)
record[name.strip()] = value.strip()
return record
records = []
collection = []
blank_flag = False
for line in open('d:/lol.txt'):
# Read through the file collecting lines and
# looking for double blank lines
# every pair of blank lines, process the stored ones and reset
line = line.strip()
if line.startswith('Page '): continue
if line.startswith('Printed on '): continue
if not line and blank_flag: # record finished
records.append( process_record(' '.join(collection)) )
blank_flag = False
collection = []
elif not line: # maybe end of record?
blank_flag = True
else: # false alarm, record continues
blank_flag = False
collection.append(line)
for record in records:
print record
Now I think it would be much better to do some tidyup preprocessing steps on the data:
- Separate the lines "Page n of n" and "Printed on ..." and similar
- Identify all valid field names and then split the combined strings, that is, each row has only one field, the fields start at the beginning of the row.
- Run and process the Seller / Buyer / Agents blocks, replacing the field names with an identification prefix, for example.
Email:
βSeller Email:
.
Then write a parser for records, which should be simple - check for two blank lines, split the lines at the first colon, use the left bit as the field name and the right bit as the value. Save however you want (nb. That the dictionary keys are unordered).
source to share
I find it easier to start a new entry by clicking on the word "Page".
Just share your experience - it's too hard to write a generic parser.
The situation is not so bad given the data here. Instead of using a simple list to store the entry, use an object. Add all objects as attributes / values ββto the object.
source to share