Analyzing text with Python 2.7
Text file
• I.D.: AN000015544
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398
• I.D.: AN000016955
DESCRIPTION: TEMPERATURE CALIBRATOR
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063
• I.D.: AN000017259
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076
• I.D.: AN000032766
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253B CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY5048 9036
purpose
Striving to find a more efficient algorithm for parsing the manufacturer's name and number. ie 'HEWLETT-PACKARDMODEL NUM .: 34401A', 'AGILENT MODEL NUM: U1253B' ... etc. from the text file above.
Data structure
parts_data = {'Model_Number': []}
Code
with open("textfile", 'r') as parts_info:
linearray = parts_info.readlines(
for line in linearray:
model_number = ''
model_name = ''
if "MANUFACTURER:" in line:
model_name = line.split(':')[1]
if "NUM.:" in line:
model_number = line.split(':')[2]
model_number = model_number.split()[0]
model_number = model_name + ' ' + model_number
parts_data['Model_Number'].append(model_number.rstrip())
My code does exactly what I want, but I think there is a faster or cleaner way to complete the action. Increase efficiency!
source to share
Your code looks great and unless you are parsing more than GB of data, I don't know what the problem is. I thought of a few things.
If you remove the line linearray = parts_info.readlines(
, Python realizes that it is just using an open file for loop to make this whole thing streamed in case your file is huge. Currently, this line of code will try to read the entire file into memory at once, rather than going one by one, so you will crash the computer if you have a file larger than your memory.
You can also combine if statements and do 1 conditional as you only seem to care about having both fields. In the interest of cleaner code, you also don't needmodel_number = ''; model_name = ''
Saving the results of things like line.split(':')
can help.
Alternatively, you can try regular expression. There is no way to tell which one would perform better without testing both, which brings me back to what I said at the beginning: code optimization is hard and shouldn't really be done unless it's necessary. If you really cared about efficiency, you would use a type program awk
written in C.
source to share
One direct way is using regex:
with open("textfile", 'r') as parts_info:
for line in parts_info:
m=re.search(r'[A-Z ]+ NUM\.: [A-Z\d]+',line)
if m:
print m.group(0)
result:
'PACKARDMODEL NUM.: 34401A',
' FLUKE MODEL NUM.: 724',
' AGILENT MODEL NUM.: U1253A',
' AGILENT MODEL NUM.: U1253B'
source to share
Several things come to my mind:
- You can do it
split(':')
once and reuse it - if the number is
:
always the same then discard the ifs and check with length once
I end up with something like this
parts_data = {'Model_Number': []}
with open("textfile.txt", 'r') as parts_info:
linearray = parts_info.readlines()
for line in linearray:
linesp = line.split(':')
if len(linesp)>2:
model_name = linesp[1]
model_number = linesp[2]
model_number = model_number.split()[0]
model_number = model_name + ' ' + model_number
parts_data['Model_Number'].append(model_number.rstrip())
source to share