Why python re.search method hangs?

I am using the python regex library to parse some strings and I currently find my regex is too complex or the string I am looking for is too long.

Here's an example of a hang:

>>> import re
>>> reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))"
>>> re.search(reg, "**LOOKING FOR PAYPAL OFFERS ON THESE PAINTED UNCOMMONS**") #Hangs here...

      

I'm not sure what's going on. Any help is appreciated!

EDIT: Here's a link with examples of what I'm trying to match: Regxr

+3


source to share


1 answer


The reason the code gets executed is catastrophic backtracking due to one required and 1+ optional patterns (those that might match an empty string) within a quantified group (\w+'?\s*)+

, which allows the regex engine to test many suitable paths, so many of them require too long to complete.

I suggest expanding the problem group in such a way that '

either \s

become mandatory and wrap them in an optional group:

(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)
^^^^^^^^^^^^^^^^^^^***

      

See regex demo



Here (\w+(?:['\s]+\w+)*)

will match 1+ word of words followed by 0+ 1+ sequences '

or spaces followed by 1+ word characters. Thus, the pattern becomes linear and the regex engine does not match faster if an inconsistent string occurs.

The rest of the template:

  • \s*[-~]\s*

    - either -

    or ~

    wrapped with spaces 0+
  • (\$?\d+(?:\.\d+)?\$?)

    - capture of group 2
    • \$?

      - 1 or 0 $

      characters
    • \d+

      - 1 + numbers
    • (?:\.\d+)?

      - 1 or 0 zero sequences:
      • \.

        - point
      • \d+

        - 1 + numbers
    • \$?

      - 1 or 0 $

      characters
+3


source







All Articles