Re.findall () is not as greedy as expected - Python 2.7

Question

Re.findall () is not as greedy as expected - Python 2.7

I am trying to deduce a list of complete sentences from a plaintext body using regex in python 2.7. For my purposes, it is not important that anything that could be construed as a complete sentence should be on the list, but everything on the list should be a complete sentence. Below is some code that illustrates the problem:

import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences

In this regex tester , I should theoretically get a list like this:

>>> ["Hello World!", "This is your captain speaking."]

But the output I actually get is as follows:

>>> [' World', ' speaking']

The documentation indicates that searches are performed from left to right and that the * and + operators are eagerly processed. Appreciate help.

+3

python regex findall

Lee richards May 06 '17 at 21:27

source to share

2 answers

Raymond Hettinger · Answer 1 · 2017-05-06T21:35:55+0000

The problem is that findall () only shows the captured subgroups and not a complete match. In the docs for re.findall () :

If one or more groups are present in the template, return the group list; it will be a list of tuples if the template has more than one group.

It's easy to see what's going on using re.finditer () and examining the match objects :

>>> import re
>>> text = "Hello World! This is your captain speaking."

>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)

>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)

>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)

The solution to your problem is to suppress subgroups with ?:

. Then you get the expected results:

>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'

dawg · Answer 2 · 2017-05-08T01:00:35+0000

You can change your regex somewhat:

>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

Re.findall () is not as greedy as expected - Python 2.7

More articles: