Re.findall () is not as greedy as expected - Python 2.7

I am trying to deduce a list of complete sentences from a plaintext body using regex in python 2.7. For my purposes, it is not important that anything that could be construed as a complete sentence should be on the list, but everything on the list should be a complete sentence. Below is some code that illustrates the problem:

import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences

      

In this regex tester , I should theoretically get a list like this:

>>> ["Hello World!", "This is your captain speaking."]

      

But the output I actually get is as follows:

>>> [' World', ' speaking']

      

The documentation indicates that searches are performed from left to right and that the * and + operators are eagerly processed. Appreciate help.

+3


source to share


2 answers


The problem is that findall () only shows the captured subgroups and not a complete match. In the docs for re.findall () :

If one or more groups are present in the template, return the group list; it will be a list of tuples if the template has more than one group.

It's easy to see what's going on using re.finditer () and examining the match objects :



>>> import re
>>> text = "Hello World! This is your captain speaking."

>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)

>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)

>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)

      

The solution to your problem is to suppress subgroups with ?:

. Then you get the expected results:

>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'

      

+5


source


You can change your regex somewhat:



>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

      

0


source







All Articles