How can I improve the speed of this readline loop in python?
I am importing several pieces of Databasedump in text format in MySQL, the problem is that there is a lot of interesting stuff infront before the interesting data. I wrote this loop to get the required data:
def readloop(DBFILE):
txtdb=open(DBFILE, 'r')
sline = ""
# loop till 1st "customernum:" is found
while sline.startswith("customernum: ") is False:
sline = txtdb.readline()
while sline.startswith("customernum: "):
data = []
data.append(sline)
sline = txtdb.readline()
while sline.startswith("customernum: ") is False:
data.append(sline)
sline = txtdb.readline()
if len(sline) == 0:
break
customernum = getitem(data, "customernum: ")
street = getitem(data, "street: ")
country = getitem(data, "country: ")
zip = getitem(data, "zip: ")
The text file is quite large, so just loop until the first entry you want takes a long time. Anyone have an idea if this could be done faster (or if I fixed it all the way, it wouldn't be the best idea)?
Thank you very much in advance!
source to share
The general idea of optimization is to go to "big blocks" (mostly ignore the structure of the lines) to find the first line of interest and then move on to post-processing for the rest). It's somewhat refined and error-prone (one-by-one and the like), so it really needs testing, but the general idea is this:
import itertools
def readloop(DBFILE):
txtdb=open(DBFILE, 'r')
tag = "customernum: "
BIGBLOCK = 1024 * 1024
# locate first occurrence of tag at line-start
# (assumes the VERY FIRST line doesn't start that way,
# else you need a special-case and slight refactoring)
blob = ''
while True:
blob = blob + txtdb.read(BIGBLOCK)
if not blob:
# tag not present at all -- warn about that, then
return
where = blob.find('\n' + tag)
if where != -1: # found it!
blob = blob[where+1:] + txtdb.readline()
break
blob = blob[-len(tag):]
# now make a by-line iterator over the part of interest
thelines = itertools.chain(blob.splitlines(1), txtdb)
sline = next(thelines, '')
while sline.startswith(tag):
data = []
data.append(sline)
sline = next(thelines, '')
while not sline.startswith(tag):
data.append(sline)
sline = next(thelines, '')
if not sline:
break
customernum = getitem(data, "customernum: ")
street = getitem(data, "street: ")
country = getitem(data, "country: ")
zip = getitem(data, "zip: ")
Here I have tried to keep as much of my structure as possible with only minor improvements outside of the "big idea" of this refactoring.
source to share
Please don't write this code:
while condition is False:
Boolean conditions are boolean for shouting out loud, so they can be tested (or denied and tested) directly:
while not condition:
The second while loop is not written as "while condition is True:", I am curious why you felt the need to test "False" in the first.
After pulling out this module, I thought I would analyze this a bit. In my pyparsing experience, function calls are common performance killers, so it would be nice to avoid function calls if possible. Here's your original test:
>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_ATTR 0 (startswith)
6 LOAD_CONST 0 ('customernum')
9 CALL_FUNCTION 1
12 LOAD_GLOBAL 1 (False)
15 COMPARE_OP 8 (is)
18 RETURN_VALUE
Two expensive things happen here: CALL_FUNCTION
and LOAD_GLOBAL
. You can shorten LOAD_GLOBAL
it by specifying a local name for False:
>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_ATTR 0 (startswith)
6 LOAD_CONST 0 ('customernum')
9 CALL_FUNCTION 1
12 LOAD_FAST 1 (False)
15 COMPARE_OP 8 (is)
18 RETURN_VALUE
But what if we just remove the 'is' test entirely ?:
>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_ATTR 0 (startswith)
6 LOAD_CONST 0 ('customernum')
9 CALL_FUNCTION 1
12 UNARY_NOT
13 RETURN_VALUE
We collapsed a LOAD_xxx
and COMPARE_OP
with simple UNARY_NOT
. "False" definitely doesn't help performance, cause something.
Now, what if we can do some complete line exception with no function calls at all. If the first character of the string is not "c" then it will not run ("customernum"). Try the following:
>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
1 0 LOAD_FAST 0 (t)
3 LOAD_CONST 0 (0)
6 BINARY_SUBSCR
7 LOAD_CONST 1 ('c')
10 COMPARE_OP 3 (!=)
13 JUMP_IF_FALSE 14 (to 30)
16 POP_TOP
17 LOAD_FAST 0 (t)
20 LOAD_ATTR 0 (startswith)
23 LOAD_CONST 2 ('customernum')
26 CALL_FUNCTION 1
29 UNARY_NOT
>> 30 RETURN_VALUE
(Note that using [0] to get the first character of the string does not create a slice — it is actually very fast.)
Now, assuming there are not many lines starting with 'c', a coarse cut filter can eliminate the line using all fairly quick instructions. In fact, by testing "t [0]! = 'C'" instead of "not t [0] == 'c'", we are saving the extraneous instruction UNARY_NOT
.
So, using this short path optimization study, I suggest changing this code:
while sline.startswith("customernum: ") is False:
sline = txtdb.readline()
while sline.startswith("customernum: "):
... do the rest of the customer data stuff...
For this:
for sline in txtdb:
if sline[0] == 'c' and \
sline.startswith("customernum: "):
... do the rest of the customer data stuff...
Note that I also removed the .readline () function call and just looped over the file using "for sline to txtdb".
I understand that Alex provided other code entirely for finding that first line "customernum", but I would try to optimize within the general framework of your algorithm before pulling out the big but obscure block reading guns.
source to share
I think you are writing this import script and it gets bored of waiting during testing, so the data stays the same all the time.
You can run the script once to determine the actual positions in the file you want to navigate to with print txtdb.tell()
. Write them down and replace the search code txtdb.seek( pos )
. Basically, this is creating an index for a file; -)
Another more convenient way would be to read the data in large chunks, several MB at a time, not just a few bytes per line.
source to share