Improving Python Code Performance

How can I improve the performance of this simple piece of python code? Not the re.search

best way to find a matching string as it is almost 6x slower than Perl or am I doing something wrong?

#!/usr/bin/env python

import re
import time
import sys

i=0
j=0
time1=time.time()
base_register =r'DramBaseAddress\d+'
for line in  open('rndcfg.cfg'):
    i+=1
    if(re.search(base_register, line)):
        j+=1
time2=time.time()

print (i,j)
print (time2-time1)    
print (sys.version)

      

This code takes about 0.96 seconds (10 runs on average)
Output:

168197 2688
0.8597519397735596
3.3.2 (default, Sep 24 2013, 15:14:17)
[GCC 4.1.1]

      

while the following Perl code does it in 0.15 seconds.

#!/usr/bin/env perl
use strict;
use warnings;

use Time::HiRes qw(time);

my $i=0;my $j=0;
my $time1=time;
open(my $fp, 'rndcfg.cfg');
while(<$fp>)
{
    $i++;
    if(/DramBaseAddress\d+/)
    {
        $j++;
    }
}
close($fp);
my $time2=time;

printf("%d,%d\n",$i,$j);
printf("%f\n",$time2-$time1);
printf("%s\n",$]);

      


Output:

168197,2688
0.135579
5.012001

      

EDIT: Fixed regex - which degraded performance slightly

+3


source to share


3 answers


in fact, regex is less efficient than string methods in Python. From https://docs.python.org/2/howto/regex.html#use-string-methods :

Strings have several methods for performing fixed string operations and they're usually much faster because the implementation of one small C loop that has been optimized for this purpose, instead of this large, more generalized regex engine.



replacing re.search

with str.find

will give you a better lead time. otherwise, using the operator in

that others have suggested will also be optimized.

As for the speed difference between Python and Perl version, I'll just do it with the quality of each language in mind: word processing - python vs perl performance

+5


source


In this case, you are using a fixed string, not a regular expression.

For regular strings, there are faster methods:



>>> timeit.timeit('re.search(regexp, "banana")', setup = "import re;     regexp=r'nan'")
1.2156920433044434
>>> timeit.timeit('"banana".index("nan")')
0.23752403259277344
>>> timeit.timeit('"banana".find("nan")')
0.2411658763885498

      

This kind of word processing is now the sweet spot of Perl (aka Practical Extraction and Reporting Language) (aka Pathological Eclectic Trash Lister) and has been optimized over the years. Anything that has a collective focus.

+1


source


The call overhead re.compile

, despite the caching, is huge. Use

is_wanted_line = re.compile(r"DramBaseAddress\d+").search

for i, line in enumerate(open('rndcfg.cfg')):
    if is_wanted_line(line):
        j += 1

      

instead.

Alternatively, you can do

key = "DramBaseAddress"
is_wanted_line = re.compile(r"DramBaseAddress\d+").search

for i, line in enumerate(open('rndcfg.cfg')):
    if key in line and is_wanted_line(line):
        j += 1

      

to reduce overhead.

You can also consider creating your own buffering:

key = b"DramBaseAddress"
is_wanted_line = re.compile(rb"DramBaseAddress\d+").search

with open("rndcfg.cfg", "rb") as file:
    rest = b""

    for chunk in iter(lambda: file.read(32768), b""):
        i += chunk.count(b"\n")
        chunk, _, rest = (rest + chunk).rpartition(b"\n")

        if key in rest and is_wanted_line(chunk):
            j += 1

    if key in rest and is_wanted_line(rest):
        j += 1

      

which removes the overhead for string splitting and encoding. (This is not quite the same as not allowing for multiple instances per chunk. This behavior is relatively easy to add, but not necessarily strictly necessary in your case.)

It's a bit heavyweight, but three times faster than Perl-8x if you remove i += chunk.count(b"\n")

!

+1


source







All Articles