Analyzing text with Python 2.7

Text file

• I.D.: AN000015544 
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER 
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398 
• I.D.: AN000016955 
DESCRIPTION: TEMPERATURE CALIBRATOR 
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063 
• I.D.: AN000017259 
DESCRIPTION: TRUE RMS MULTIMETER 
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076 
• I.D.: AN000032766                         
DESCRIPTION: TRUE RMS MULTIMETER                            
MANUFACTURER: AGILENT MODEL NUM.: U1253B CALIBRATION    -   DUE DATE:6/1/2016   SERIAL  NUMBER: MY5048  9036

      

purpose

Striving to find a more efficient algorithm for parsing the manufacturer's name and number. ie 'HEWLETT-PACKARDMODEL NUM .: 34401A', 'AGILENT MODEL NUM: U1253B' ... etc. from the text file above.

Data structure

parts_data = {'Model_Number': []}

      

Code

with open("textfile", 'r') as parts_info:
    linearray = parts_info.readlines(
    for line in linearray:
        model_number = ''
        model_name = ''
        if "MANUFACTURER:" in line:
            model_name = line.split(':')[1]
        if "NUM.:" in line:
            model_number = line.split(':')[2]
            model_number = model_number.split()[0]
            model_number = model_name + ' ' + model_number
            parts_data['Model_Number'].append(model_number.rstrip())

      

My code does exactly what I want, but I think there is a faster or cleaner way to complete the action. Increase efficiency!

+3


source to share


3 answers


Your code looks great and unless you are parsing more than GB of data, I don't know what the problem is. I thought of a few things.

If you remove the line linearray = parts_info.readlines(

, Python realizes that it is just using an open file for loop to make this whole thing streamed in case your file is huge. Currently, this line of code will try to read the entire file into memory at once, rather than going one by one, so you will crash the computer if you have a file larger than your memory.

You can also combine if statements and do 1 conditional as you only seem to care about having both fields. In the interest of cleaner code, you also don't needmodel_number = ''; model_name = ''



Saving the results of things like line.split(':')

can help.

Alternatively, you can try regular expression. There is no way to tell which one would perform better without testing both, which brings me back to what I said at the beginning: code optimization is hard and shouldn't really be done unless it's necessary. If you really cared about efficiency, you would use a type program awk

written in C.

+1


source


One direct way is using regex:

with open("textfile", 'r') as parts_info:
     for line in parts_info:
          m=re.search(r'[A-Z ]+ NUM\.: [A-Z\d]+',line)
          if m:
                print m.group(0)

      



result:

'PACKARDMODEL NUM.: 34401A', 
' FLUKE MODEL NUM.: 724', 
' AGILENT MODEL NUM.: U1253A', 
' AGILENT MODEL NUM.: U1253B'

      

+1


source


Several things come to my mind:

  • You can do it split(':')

    once and reuse it
  • if the number is :

    always the same then discard the ifs and check with length once

I end up with something like this

parts_data = {'Model_Number': []}
with open("textfile.txt", 'r') as parts_info:
    linearray = parts_info.readlines()

for line in linearray:
    linesp = line.split(':')
    if len(linesp)>2:
        model_name = linesp[1]
        model_number = linesp[2]
        model_number = model_number.split()[0]
        model_number = model_name + ' ' + model_number
        parts_data['Model_Number'].append(model_number.rstrip())

      

0


source







All Articles