Re.findall () is not as greedy as expected - Python 2.7
I am trying to deduce a list of complete sentences from a plaintext body using regex in python 2.7. For my purposes, it is not important that anything that could be construed as a complete sentence should be on the list, but everything on the list should be a complete sentence. Below is some code that illustrates the problem:
import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences
In this regex tester , I should theoretically get a list like this:
>>> ["Hello World!", "This is your captain speaking."]
But the output I actually get is as follows:
>>> [' World', ' speaking']
The documentation indicates that searches are performed from left to right and that the * and + operators are eagerly processed. Appreciate help.
source to share
The problem is that findall () only shows the captured subgroups and not a complete match. In the docs for re.findall () :
If one or more groups are present in the template, return the group list; it will be a list of tuples if the template has more than one group.
It's easy to see what's going on using re.finditer () and examining the match objects :
>>> import re
>>> text = "Hello World! This is your captain speaking."
>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)
>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)
The solution to your problem is to suppress subgroups with ?:
. Then you get the expected results:
>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'
source to share