Why python re.search method hangs?
I am using the python regex library to parse some strings and I currently find my regex is too complex or the string I am looking for is too long.
Here's an example of a hang:
>>> import re
>>> reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))"
>>> re.search(reg, "**LOOKING FOR PAYPAL OFFERS ON THESE PAINTED UNCOMMONS**") #Hangs here...
I'm not sure what's going on. Any help is appreciated!
EDIT: Here's a link with examples of what I'm trying to match: Regxr
source to share
The reason the code gets executed is catastrophic backtracking due to one required and 1+ optional patterns (those that might match an empty string) within a quantified group (\w+'?\s*)+
, which allows the regex engine to test many suitable paths, so many of them require too long to complete.
I suggest expanding the problem group in such a way that '
either \s
become mandatory and wrap them in an optional group:
(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)
^^^^^^^^^^^^^^^^^^^***
See regex demo
Here (\w+(?:['\s]+\w+)*)
will match 1+ word of words followed by 0+ 1+ sequences '
or spaces followed by 1+ word characters. Thus, the pattern becomes linear and the regex engine does not match faster if an inconsistent string occurs.
The rest of the template:
-
\s*[-~]\s*
- either-
or~
wrapped with spaces 0+ -
(\$?\d+(?:\.\d+)?\$?)
- capture of group 2-
\$?
- 1 or 0$
characters -
\d+
- 1 + numbers -
(?:\.\d+)?
- 1 or 0 zero sequences:-
\.
- point -
\d+
- 1 + numbers
-
-
\$?
- 1 or 0$
characters
-
source to share