Regression Pattern Catastrophic Countdown

Question

Regression Pattern Catastrophic Countdown

I have the below regex used on one of my old Java systems that has been causing return problems lately. Quite often, backflows cause the machine's CPU to hit an upper limit, and it doesn't bounce back until the application is restarted.

Can anyone suggest a better way to rewrite this pattern, or a tool to help me do this?

Pattern:

^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$

Values:

[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]

Catastrophic backtracking values - if there is something wrong in the values (additional binding added at the end):

[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]

+3

java regex backtracking

Achilles Apr 20 15 at 14:50

source to share

3 answers

Your backtracking loop is caused by a double ]

at the end of the line.
I cleared the line before running the regex, replacing all occurrences [[

with [

and all ]]

with ]

.

+1

Binary Worrier Apr 20 15 at 14:55

source to share

You can use atomic grouping or possessive quantifiers to avoid this catastrophic fallback. Here's an example with atomic grouping that only takes 60 steps before your failed computation:

^(?>\[((\p{N}*\]\,\[\p{N}*)*|\p{N}*)\])$

See the demo here .

0

Wiktor Stribiżew Apr 20 15 at 14:59

source to share

Alan moore · Accepted Answer · 2015-04-20T16:07:32+0000

Never use *

when +

is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm sure you don't want to be treated []

as a valid input.

One of the biggest reasons for a fugitive is to have two or more alternatives that may be the same. This is what you have with the part |[\p{N}]*

. The regex engine has to try every conceivable path through the string before it gives up, so all those constructs \p{N}*

end up in an endless tug-of-war over every group of numbers.

But there is no point in trying to fix these problems because the general structure is wrong. I think this is what you are looking for:

^\[\p{N}+\](?:,\[\p{N}+\])*$

After it consumes the first token ( [1234567]

), if the next thing on the line is not a comma or the end of a line, it will fire immediately. If it sees a comma, it must continue to match another full token ( [89023432]

), or it will fire immediately.

This is probably the most important thing when you create a regex: if it fails, you want it to run as quickly as possible. You can use functions like atomic groups and possessive quantifiers, but if you get the regex structure right, you rarely need them. Rollback is not inevitable.

Regression Pattern Catastrophic Countdown

More articles: