Compiling regular expressions in an often called function
Let's say I have a function that searches for multiple patterns in a string using regular expressions:
import re
def get_patterns(string):
"""
Takes a string and returns found groups
of numeric and alphabetic characters.
"""
re_digits = re.compile("(\d+)")
re_alpha = re.compile("(?i)([A-Z]+)")
digits = re_digits.findall(string)
alpha = re_alpha.findall(string)
return digits, alpha
get_patterns("99 bottles of beer on the wall")
(['99'], ['bottles', 'of', 'beer', 'on', 'the', 'wall'])
Now suppose this function will be called hundreds of thousands of times and that this is not such a trivial example. Does this mean, a) the question is whether the compilation of regexes is done inside a function, i.e. Is there a cost to call the compile operation every time the function is called (or is it reused from the cache?), And b) if so, what would be the recommended approach to avoid this overhead?
One method is to pass a list of compiled regex objects to the function:
re_digits = re.compile("(\d+)")
re_alpha = re.compile("(?i)([A-Z]+)")
def get_patterns(string, [re_digits, re_alpha]):
but I don't like how this approach decouples regexes from the dependent function.
UPDATE: As recommended by Jens, I did a quick time check and did compile on the default arguments, indeed a little (~ 30%) faster:
def get_patterns_defaults(string,
re_digits=re.compile("(\d+)"),
re_alpha=re.compile("(?i)([A-Z]+)")
):
"""
Takes a string and returns found groups
of numeric and alphabetic characters.
"""
digits = re_digits.findall(string)
alpha = re_alpha.findall(string)
return digits, alpha
from timeit import Timer
test_string = "99 bottles of beer on the wall"
t = Timer(lambda: get_patterns(test_string))
t2 = Timer(lambda: get_patterns_defaults(test_string))
print t.timeit(number=100000) # compiled in function body
print t2.timeit(number=100000) # compiled in args
0.629958152771
0.474529981613
source to share
One solution is to use default arguments, so they will only compile once:
import re
def get_patterns(string, re_digits=re.compile("(\d+)"), re_alpha=re.compile("(?i)([A-Z]+)")):
"""
Takes a string and returns found groups
of numeric and alphabetic characters.
"""
digits = re_digits.findall(string)
alpha = re_alpha.findall(string)
return digits, alpha
Now you can call it:
get_patterns(string)
source to share
You can use Python timeit (or here and here ) to measure runtime.
If you want to avoid recompiling these regexps, try initializing them as global:
import re
_re_digits = re.compile("(\d+)")
_re_alpha = re.compile("(?i)([A-Z]+)")
def get_patterns(string):
digits = _re_digits.findall(string)
alpha = _re_alpha.findall(string)
return (digits, alpha)
source to share
Fun fact: you can set and get attributes on functions in Python just like any other object. So another solution that avoids global bindings and only compiles regex would be something like this:
def get_patterns(string):
f = get_patterns
return f.digits.findall(string), f.alpha.findall(string)
get_patterns.digits = re.compile("(\d+)")
get_patterns.alpha = re.compile("(?i)([A-Z]+)")
Another solution would be to use a closure:
def make_finder(*regexps):
return lambda s: tuple(r.findall(s) for r in regexps)
get_patterns = make_finder(re.compile("(\d+)"), re.compile("(?i)([A-Z]+)"))
source to share