Inconsistent behavior of `scan` and` match` for different Ruby versions
Background
This question refers to the behavior of the method String#scan
and String#match
in Ruby. I am using a recursive regex that must match a balanced pair of curly braces. You can see this regex /(\((?:[^\(\)]*\g<0>*)*\))/
in action at: https://regex101.com/r/Q1lOC8/1 . There it displays the expected behavior: match top-level sets of parentheses that have balanced sets of nested parentheses. Sample code to illustrate the problem looks like this:
➜ cat test.rb
s = "1+(x*(3-4)+5)-1"
r = /(\((?:[^\(\)]*\g<0>*)*\))/
puts s.match(r).inspect
puts s.scan(r).inspect
Problem
I get different results when running the above code sample in ruby-2.3.3 and ruby-2.4.1:
➜ docker run --rm -v "$PWD":/usr/src/app -w /usr/src/app ruby:2.3.3-alpine ruby test.rb
#<MatchData "(x*(3-4)+5)" 1:")">
[[")"]]
➜ docker run --rm -v "$PWD":/usr/src/app -w /usr/src/app ruby:2.4.1-alpine ruby test.rb
#<MatchData "(x*(3-4)+5)" 1:"(x*(3-4)+5)">
[["(x*(3-4)+5)"]]
Ruby 2.4.1 case is what I expected. match
matches the same outer set of parentheses correctly in both cases (x*(3-4)+5)
, but in ruby-2.3.3, the first wildcard match is simple for some reason ")"
. If I change the regex to /(\(.*\))/
, the behavior is the same for both versions (same as in 2.4.1 above), but it no longer balances the nested brackets.
What is the true expected behavior match
in this case?
source to share
First, I must point out that working at regex101.com shouldn't work anywhere: any regex written with an online regex tester should be validated against the target environment. You tested with the PCRE parameter and it worked because PCRE is a different library than Onigmo used in Ruby.
Now the problem is how the Regex Onigmo engine handles recursion in 2.3.3: the construct \g<0>
returns the whole pattern (0th group) and also the outer brace-braces (group 1) are repeated too (while the ID is kept the same). effectively creating a recapture group. The values in such groups are overwritten at each iteration, which is why you get )
at the end.
The work around is to rewrite the 1st group subpattern to completely preserve the value of the 1st group without re-writing its value at each iteration (since the capture group is defined in the template, String#scan
only capture is returned).
Using
r = /(\((?:[^\(\)]*\g<1>*)*\))/ ^
source to share