PCRE relay recrex works, but routines are not
I am trying to match the texts:
1. "HeyHey HeyHey"
2. "HeyHey HeyHeyy"
with regular expressions:
a /(\w+) \1\w/
b /(\w+) (\w+)\w/
c /(\w+) (?1)\w/
- Regex a is exactly the same as 1 and 2 , but the last 'y'.
- Regex b matches exactly 1 and 2 .
- Regex c doesn't match 1 or 2 .
Following http://www.rexegg.com/regex-disambiguation.html#subroutines I thought b and c are equivalent. But apparently this is not the case.
What is the difference? Why doesn't the subroutine work while copying the same regex works?
experimented here: https://regex101.com/#pcre
source to share
This is because with PCRE the subpattern reference ( (?1)
here) is atomic by default.
(Note that this behavior is especially specific to PCRE, and Perl does not convey it.)
Subpattern \w+
(with a greedy quantifier), all the characters of the word are matched ( HeyHeyy
on the second line), but since it (?1)
is atomic, the regex engine cannot indented and return the latter y
to make it \w
successful.
You can get the same result with this template:
/(\w+) (?>\w+)\w/
# ^-----^-- atomic group
which does not match the string if, without the atomic group, the pattern succeeds:
/(\w+) \w+\w/
More on atomic groups: http://regular-expressions.info/atomic.html
This feature is also described here (but only in a recursive context): http://www.rexegg.com/regex-recursion.html (see "Recursion depths are atomic")
source to share