How can I improve the speed of this readline loop in python?

I am importing several pieces of Databasedump in text format in MySQL, the problem is that there is a lot of interesting stuff infront before the interesting data. I wrote this loop to get the required data:

def readloop(DBFILE):
    txtdb=open(DBFILE, 'r')

sline = ""

# loop till 1st "customernum:" is found
while sline.startswith("customernum:  ") is False: 
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    data = []
    data.append(sline)
    sline = txtdb.readline()
    while sline.startswith("customernum:  ") is False:
        data.append(sline)
        sline = txtdb.readline()
        if len(sline) == 0:
            break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

      

The text file is quite large, so just loop until the first entry you want takes a long time. Anyone have an idea if this could be done faster (or if I fixed it all the way, it wouldn't be the best idea)?

Thank you very much in advance!

+2


source to share


5 answers


The general idea of ​​optimization is to go to "big blocks" (mostly ignore the structure of the lines) to find the first line of interest and then move on to post-processing for the rest). It's somewhat refined and error-prone (one-by-one and the like), so it really needs testing, but the general idea is this:

import itertools

def readloop(DBFILE):
  txtdb=open(DBFILE, 'r')
  tag = "customernum:  "
  BIGBLOCK = 1024 * 1024
  # locate first occurrence of tag at line-start
  # (assumes the VERY FIRST line doesn't start that way,
  # else you need a special-case and slight refactoring)
  blob = ''
  while True:
    blob = blob + txtdb.read(BIGBLOCK)
    if not blob:
      # tag not present at all -- warn about that, then
      return
    where = blob.find('\n' + tag)
    if where != -1:  # found it!
      blob = blob[where+1:] + txtdb.readline()
      break
    blob = blob[-len(tag):]
  # now make a by-line iterator over the part of interest
  thelines = itertools.chain(blob.splitlines(1), txtdb)
  sline = next(thelines, '')
  while sline.startswith(tag):
    data = []
    data.append(sline)
    sline = next(thelines, '')
    while not sline.startswith(tag):
      data.append(sline)
      sline = next(thelines, '')
      if not sline:
        break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

      



Here I have tried to keep as much of my structure as possible with only minor improvements outside of the "big idea" of this refactoring.

+1


source


Please don't write this code:

while condition is False:

      

Boolean conditions are boolean for shouting out loud, so they can be tested (or denied and tested) directly:

while not condition:

      

The second while loop is not written as "while condition is True:", I am curious why you felt the need to test "False" in the first.

After pulling out this module, I thought I would analyze this a bit. In my pyparsing experience, function calls are common performance killers, so it would be nice to avoid function calls if possible. Here's your original test:

>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_GLOBAL              1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

      

Two expensive things happen here: CALL_FUNCTION

and LOAD_GLOBAL

. You can shorten LOAD_GLOBAL

it by specifying a local name for False:

>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_FAST                1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

      

But what if we just remove the 'is' test entirely ?:

>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 UNARY_NOT
             13 RETURN_VALUE

      



We collapsed a LOAD_xxx

and COMPARE_OP

with simple UNARY_NOT

. "False" definitely doesn't help performance, cause something.

Now, what if we can do some complete line exception with no function calls at all. If the first character of the string is not "c" then it will not run ("customernum"). Try the following:

>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_CONST               0 (0)
              6 BINARY_SUBSCR
              7 LOAD_CONST               1 ('c')
             10 COMPARE_OP               3 (!=)
             13 JUMP_IF_FALSE           14 (to 30)
             16 POP_TOP
             17 LOAD_FAST                0 (t)
             20 LOAD_ATTR                0 (startswith)
             23 LOAD_CONST               2 ('customernum')
             26 CALL_FUNCTION            1
             29 UNARY_NOT
        >>   30 RETURN_VALUE

      

(Note that using [0] to get the first character of the string does not create a slice — it is actually very fast.)

Now, assuming there are not many lines starting with 'c', a coarse cut filter can eliminate the line using all fairly quick instructions. In fact, by testing "t [0]! = 'C'" instead of "not t [0] == 'c'", we are saving the extraneous instruction UNARY_NOT

.

So, using this short path optimization study, I suggest changing this code:

while sline.startswith("customernum:  ") is False:
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    ... do the rest of the customer data stuff...

      

For this:

for sline in txtdb:
    if sline[0] == 'c' and \ 
       sline.startswith("customernum:  "):
        ... do the rest of the customer data stuff...

      

Note that I also removed the .readline () function call and just looped over the file using "for sline to txtdb".

I understand that Alex provided other code entirely for finding that first line "customernum", but I would try to optimize within the general framework of your algorithm before pulling out the big but obscure block reading guns.

+5


source


I think you are writing this import script and it gets bored of waiting during testing, so the data stays the same all the time.

You can run the script once to determine the actual positions in the file you want to navigate to with print txtdb.tell()

. Write them down and replace the search code txtdb.seek( pos )

. Basically, this is creating an index for a file; -)

Another more convenient way would be to read the data in large chunks, several MB at a time, not just a few bytes per line.

+1


source


0


source


Tell us more about the file.

Can you use file.seek to do binary search? Search halfway, read a few lines, determine if you are before or after the part you need recursion. This will turn your search O (n) to O (logn).

0


source







All Articles