Positive lookbehind vs non-captureing group: different behavior
I use python regex ( re
module) in my code and notice different behavior in these cases:
re.findall(r'\s*(?:[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # non-capturing group
# results in ['a) xyz', ' b) abc']
and
re.findall(r'\s*(?<=[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # lookbehind
# results in ['a', ' xyz', ' b', ' abc']
I only need to get ['xyz', 'abc']
. Why do the examples behave differently and how do you get the desired result?
source to share
The reason a
and is b
included in the second case is because it (?<=[a-z]\))
will find it first a)
, and since the reverse side doesn't consume a character, you're back at the beginning of the line. Now [^.)]+
matchesa
You are now on )
. Since you made it (?<=[a-z]\))
optional [^.)]+
, matchesxyz
The same is repeated with b) abc
remove ?
from the second case and you get the expected result ie['xyz', 'abc']
source to share
The regex you're looking for is:
re.findall(r'(?<=[a-z]\) )[^) .]+', 'a) xyz. b) abc.')
I believe that Anirudha's currently accepted answer explains the differences between your use of a positive lookbehind and an unsatisfactory one, however the suggestion to remove ?
after a positive lookbehind actually results in [' xyz', ' abc']
(note the including spaces).
This is because the positive lookbehind does not match the character space
and also does not include the space
matched character in the main class.
source to share