Inconsistent behavior of `scan` and` match` for different Ruby versions

Background

This question refers to the behavior of the method String#scan

and String#match

in Ruby. I am using a recursive regex that must match a balanced pair of curly braces. You can see this regex /(\((?:[^\(\)]*\g<0>*)*\))/

in action at: https://regex101.com/r/Q1lOC8/1 . There it displays the expected behavior: match top-level sets of parentheses that have balanced sets of nested parentheses. Sample code to illustrate the problem looks like this:

➜  cat test.rb                                                                          
s = "1+(x*(3-4)+5)-1"
r = /(\((?:[^\(\)]*\g<0>*)*\))/
puts s.match(r).inspect
puts s.scan(r).inspect

      

Problem

I get different results when running the above code sample in ruby-2.3.3 and ruby-2.4.1:

➜  docker run --rm -v "$PWD":/usr/src/app -w /usr/src/app ruby:2.3.3-alpine ruby test.rb
#<MatchData "(x*(3-4)+5)" 1:")">
[[")"]]
➜  docker run --rm -v "$PWD":/usr/src/app -w /usr/src/app ruby:2.4.1-alpine ruby test.rb
#<MatchData "(x*(3-4)+5)" 1:"(x*(3-4)+5)">
[["(x*(3-4)+5)"]]

      

Ruby 2.4.1 case is what I expected. match

matches the same outer set of parentheses correctly in both cases (x*(3-4)+5)

, but in ruby-2.3.3, the first wildcard match is simple for some reason ")"

. If I change the regex to /(\(.*\))/

, the behavior is the same for both versions (same as in 2.4.1 above), but it no longer balances the nested brackets.

What is the true expected behavior match

in this case?

+3


source to share


1 answer


First, I must point out that working at regex101.com shouldn't work anywhere: any regex written with an online regex tester should be validated against the target environment. You tested with the PCRE parameter and it worked because PCRE is a different library than Onigmo used in Ruby.

Now the problem is how the Regex Onigmo engine handles recursion in 2.3.3: the construct \g<0>

returns the whole pattern (0th group) and also the outer brace-braces (group 1) are repeated too (while the ID is kept the same). effectively creating a recapture group. The values ​​in such groups are overwritten at each iteration, which is why you get )

at the end.

The work around is to rewrite the 1st group subpattern to completely preserve the value of the 1st group without re-writing its value at each iteration (since the capture group is defined in the template, String#scan

only capture is returned).



Using

r = /(\((?:[^\(\)]*\g<1>*)*\))/
                      ^

      

+1


source







All Articles