Why doesn't the regex write the start word? python

Why is my regex pattern not capturing the word before the preposition?

My regex pattern is trying to capture Nouns that have prepositions after them. For example: • Academy of Management → Academy • McGraw Hill Foundation of Books → Foundation

For the following text:

"The Academy of Enterprise Management and McGraw Hill presents an annual award for individuals who design and innovate in entrepreneurship pedagogy for graduates or students."

pp = r'[A-Z][A-Za-z]+\s+\b(for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'

x2 = re.findall(pp,test)

      

x2

outputs:

'of'

Why doesn't he display the Academy?

+3


source to share


4 answers


A capturing group is a section of a regular expression enclosed in parentheses ( )

. They are used to extract specific sections from the corresponding expression. It looks like you ran into them by accident, since you are using it to match for, from, to, or to.

When you have one capture group in your expression (as in your question), re.findall

will return a list of matches for that group. At the moment, you don't have a group around the first part of your regex. If you want to capture it, you must also wrap it in some parentheses:

pp=r'([A-Z][A-Za-z]+\s+\b(for|of|in|by))\b(?=\s+[A-Z][A-Za-z]+)'
#    ^                                 ^
re.findall(pp,test)

      

returns:

[('Academy of', 'of')]

      



Now re.findall

returned a list of tuples because there are now multiple capture groups. The elements of a tuple are displayed in the order in which the groups begin.

If you don't want it to match another group, you can change it to not capture:

(?:for|of|in|by)

      

Then the only thing that will be recorded is ['Academy of']

. Although now you only have one capturing group left, so you can do without the parentheses and re.findall

will return whatever matches the full regex.

pp=r'[A-Z][A-Za-z]+\s+\b(?:for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'

      

+3


source


Just put the capturing group for the word before the preposition:

pp = r'([A-Z][A-Za-z]+)\s+\b(for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'



Or, if you want to grab a whole line of a word / preposition:

pp = r'([A-Z][A-Za-z]+\s+\b(?:for|of|in|by))\b(?=\s+[A-Z][A-Za-z]+)'

+3


source


The actual regex search works as you'd expect. What touches you is that for|of|in|by

there is a capture group in parentheses around .

From the re.findall()

docs :

If one or more groups are present in the template, return the list of groups.

Here's how you can fix it:

pp = r'[A-Z][A-Za-z]+\s+\b(?:for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'
                           ^^

      

(?:...)

represents a non-capture group . This re.findall()

will return the entire match.

+2


source


From the documentation for re.findall

:

If one or more groups are present in the template, return a list of groups; it will be a list of tuples if the template has more than one group.

In the case of your template, you have one capture group (for|of|in|by)

and one non-capturing record (?=\s+[A-Z][A-Za-z]+)

(not capturing due to the question mark).

If you want to return Academy as one line, just make the capturing group not captured:

pp = r'[A-Z][A-Za-z]+\s+\b(?:for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'
                           ^
re.findall(pp,test) # prints ['Academy of']

      

+1


source







All Articles