Compiling regular expressions in an often called function

Let's say I have a function that searches for multiple patterns in a string using regular expressions:

import re
def get_patterns(string):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    re_digits = re.compile("(\d+)")
    re_alpha = re.compile("(?i)([A-Z]+)")
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

get_patterns("99 bottles of beer on the wall")
(['99'], ['bottles', 'of', 'beer', 'on', 'the', 'wall'])

      

Now suppose this function will be called hundreds of thousands of times and that this is not such a trivial example. Does this mean, a) the question is whether the compilation of regexes is done inside a function, i.e. Is there a cost to call the compile operation every time the function is called (or is it reused from the cache?), And b) if so, what would be the recommended approach to avoid this overhead?

One method is to pass a list of compiled regex objects to the function:

re_digits = re.compile("(\d+)")
re_alpha = re.compile("(?i)([A-Z]+)")
def get_patterns(string, [re_digits, re_alpha]):

      

but I don't like how this approach decouples regexes from the dependent function.

UPDATE: As recommended by Jens, I did a quick time check and did compile on the default arguments, indeed a little (~ 30%) faster:

def get_patterns_defaults(string, 
                          re_digits=re.compile("(\d+)"), 
                          re_alpha=re.compile("(?i)([A-Z]+)")
                          ):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

from timeit import Timer
test_string = "99 bottles of beer on the wall"
t = Timer(lambda: get_patterns(test_string))
t2 = Timer(lambda: get_patterns_defaults(test_string))
print t.timeit(number=100000)  # compiled in function body
print t2.timeit(number=100000)  # compiled in args
0.629958152771
0.474529981613

      

+3


source to share


3 answers


One solution is to use default arguments, so they will only compile once:

import re
def get_patterns(string, re_digits=re.compile("(\d+)"), re_alpha=re.compile("(?i)([A-Z]+)")):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

      



Now you can call it:

get_patterns(string)

      

+4


source


You can use Python timeit (or here and here ) to measure runtime.

If you want to avoid recompiling these regexps, try initializing them as global:



import re

_re_digits = re.compile("(\d+)")
_re_alpha = re.compile("(?i)([A-Z]+)")

def get_patterns(string): 
    digits = _re_digits.findall(string)
    alpha = _re_alpha.findall(string)
    return (digits, alpha)

      

+1


source


Fun fact: you can set and get attributes on functions in Python just like any other object. So another solution that avoids global bindings and only compiles regex would be something like this:

def get_patterns(string):
    f = get_patterns
    return f.digits.findall(string), f.alpha.findall(string)

get_patterns.digits = re.compile("(\d+)")
get_patterns.alpha = re.compile("(?i)([A-Z]+)")

      

Another solution would be to use a closure:

def make_finder(*regexps):
    return lambda s: tuple(r.findall(s) for r in regexps)

get_patterns = make_finder(re.compile("(\d+)"), re.compile("(?i)([A-Z]+)"))

      

+1


source







All Articles