Positive lookbehind vs non-captureing group: different behavior

I use python regex ( re

module) in my code and notice different behavior in these cases:

re.findall(r'\s*(?:[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # non-capturing group
# results in ['a) xyz', ' b) abc']

      

and

re.findall(r'\s*(?<=[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # lookbehind
# results in ['a', ' xyz', ' b', ' abc']

      

I only need to get ['xyz', 'abc']

. Why do the examples behave differently and how do you get the desired result?

+3


source to share


2 answers


The reason a

and is b

included in the second case is because it (?<=[a-z]\))

will find it first a)

, and since the reverse side doesn't consume a character, you're back at the beginning of the line. Now [^.)]+

matchesa

You are now on )

. Since you made it (?<=[a-z]\))

optional [^.)]+

, matchesxyz



The same is repeated with b) abc

remove ?

from the second case and you get the expected result ie['xyz', 'abc']

+4


source


The regex you're looking for is:

re.findall(r'(?<=[a-z]\) )[^) .]+', 'a) xyz. b) abc.')

      



I believe that Anirudha's currently accepted answer explains the differences between your use of a positive lookbehind and an unsatisfactory one, however the suggestion to remove ?

after a positive lookbehind actually results in [' xyz', ' abc']

(note the including spaces).

This is because the positive lookbehind does not match the character space

and also does not include the space

matched character in the main class.

0


source







All Articles