Improving Python Code Performance
How can I improve the performance of this simple piece of python code? Not the re.search
best way to find a matching string as it is almost 6x slower than Perl or am I doing something wrong?
#!/usr/bin/env python
import re
import time
import sys
i=0
j=0
time1=time.time()
base_register =r'DramBaseAddress\d+'
for line in open('rndcfg.cfg'):
i+=1
if(re.search(base_register, line)):
j+=1
time2=time.time()
print (i,j)
print (time2-time1)
print (sys.version)
This code takes about 0.96 seconds (10 runs on average)
Output:
168197 2688
0.8597519397735596
3.3.2 (default, Sep 24 2013, 15:14:17)
[GCC 4.1.1]
while the following Perl code does it in 0.15 seconds.
#!/usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw(time);
my $i=0;my $j=0;
my $time1=time;
open(my $fp, 'rndcfg.cfg');
while(<$fp>)
{
$i++;
if(/DramBaseAddress\d+/)
{
$j++;
}
}
close($fp);
my $time2=time;
printf("%d,%d\n",$i,$j);
printf("%f\n",$time2-$time1);
printf("%s\n",$]);
Output:
168197,2688
0.135579
5.012001
EDIT: Fixed regex - which degraded performance slightly
source to share
in fact, regex is less efficient than string methods in Python. From https://docs.python.org/2/howto/regex.html#use-string-methods :
Strings have several methods for performing fixed string operations and they're usually much faster because the implementation of one small C loop that has been optimized for this purpose, instead of this large, more generalized regex engine.
replacing re.search
with str.find
will give you a better lead time. otherwise, using the operator in
that others have suggested will also be optimized.
As for the speed difference between Python and Perl version, I'll just do it with the quality of each language in mind: word processing - python vs perl performance
source to share
In this case, you are using a fixed string, not a regular expression.
For regular strings, there are faster methods:
>>> timeit.timeit('re.search(regexp, "banana")', setup = "import re; regexp=r'nan'")
1.2156920433044434
>>> timeit.timeit('"banana".index("nan")')
0.23752403259277344
>>> timeit.timeit('"banana".find("nan")')
0.2411658763885498
This kind of word processing is now the sweet spot of Perl (aka Practical Extraction and Reporting Language) (aka Pathological Eclectic Trash Lister) and has been optimized over the years. Anything that has a collective focus.
source to share
The call overhead re.compile
, despite the caching, is huge. Use
is_wanted_line = re.compile(r"DramBaseAddress\d+").search
for i, line in enumerate(open('rndcfg.cfg')):
if is_wanted_line(line):
j += 1
instead.
Alternatively, you can do
key = "DramBaseAddress"
is_wanted_line = re.compile(r"DramBaseAddress\d+").search
for i, line in enumerate(open('rndcfg.cfg')):
if key in line and is_wanted_line(line):
j += 1
to reduce overhead.
You can also consider creating your own buffering:
key = b"DramBaseAddress"
is_wanted_line = re.compile(rb"DramBaseAddress\d+").search
with open("rndcfg.cfg", "rb") as file:
rest = b""
for chunk in iter(lambda: file.read(32768), b""):
i += chunk.count(b"\n")
chunk, _, rest = (rest + chunk).rpartition(b"\n")
if key in rest and is_wanted_line(chunk):
j += 1
if key in rest and is_wanted_line(rest):
j += 1
which removes the overhead for string splitting and encoding. (This is not quite the same as not allowing for multiple instances per chunk. This behavior is relatively easy to add, but not necessarily strictly necessary in your case.)
It's a bit heavyweight, but three times faster than Perl-8x if you remove i += chunk.count(b"\n")
!
source to share