Regression Pattern Catastrophic Countdown
I have the below regex used on one of my old Java systems that has been causing return problems lately. Quite often, backflows cause the machine's CPU to hit an upper limit, and it doesn't bounce back until the application is restarted.
Can anyone suggest a better way to rewrite this pattern, or a tool to help me do this?
Pattern:
^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$
Values:
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]
Catastrophic backtracking values โโ- if there is something wrong in the values โโ(additional binding added at the end):
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]
source to share
Never use *
when +
is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm sure you don't want to be treated []
as a valid input.
One of the biggest reasons for a fugitive is to have two or more alternatives that may be the same. This is what you have with the part |[\p{N}]*
. The regex engine has to try every conceivable path through the string before it gives up, so all those constructs \p{N}*
end up in an endless tug-of-war over every group of numbers.
But there is no point in trying to fix these problems because the general structure is wrong. I think this is what you are looking for:
^\[\p{N}+\](?:,\[\p{N}+\])*$
After it consumes the first token ( [1234567]
), if the next thing on the line is not a comma or the end of a line, it will fire immediately. If it sees a comma, it must continue to match another full token ( [89023432]
), or it will fire immediately.
This is probably the most important thing when you create a regex: if it fails, you want it to run as quickly as possible. You can use functions like atomic groups and possessive quantifiers, but if you get the regex structure right, you rarely need them. Rollback is not inevitable.
source to share