Python regex pulls out the first headword or the first and second words if both capitalize

The current regex formula I followed can only extract the first two headwords for a given string. I want to be able to extract only the first word in a line if the second word is not capitalized.

Here are some examples:

s = 'Smith John went to ss for Jones.'
s = 'Jones, Greg went to 2b for Smith.'
s = 'Doe went to ss for Jones.'

      

Basically, I just want the regex to output the following:

'Smith John'
'Jones, Greg'
'Doe'

      

The current regex formula I have is as follows, except it won't capture the Doe example:

new = re.findall(r'([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', s)

      

+3


source to share


2 answers


Regex overflowed. str.isupper()

works well enough:

In [11]: def getName(s):
    ...:     first, second = s.split()[:2]
    ...:     if first[0].isupper():
    ...:         if second[0].isupper():
    ...:             return ' '.join([first, second])
    ...:         return first
    ...:     

      

This gives:

In [12]: getName('Smith John went to ss for Jones.')
Out[12]: 'Smith John'

In [13]: getName('Jones, Greg went to 2b for Smith.')
Out[13]: 'Jones, Greg'

In [14]: getName('Doe went to ss for Jones.')
Out[14]: 'Doe'

      

Add some checks to keep from failing when your string only contains one word and you're good to go.




If you want to use a regular expression, you can use a pattern like this:

In [36]: pattern = re.compile(r'([A-Z].*? ){1,2}')

In [37]: pattern.match('Smith John went to ss for Jones.').group(0).rstrip()
Out[37]: 'Smith John'

In [38]: pattern.match('Doe went to ss for Jones.').group(0).rstrip()
Out[38]: 'Doe'

      

r'([A-Z].*? ){1,2}'

will match the first, optionally the second, if capitalized.

+3


source


import re
print re.match(r'([A-Z].*?(?:[, ]+)){1,}',s).group()

      



0


source







All Articles