Capturing all consecutive words with spaces with regex in python?

Question

Capturing all consecutive words with spaces with regex in python?

I am trying to match all sequential all words / phrases using regex in Python. Considering the following:

    text = "The following words are ALL CAPS. The following word is in CAPS."

The code will return:

    ALL CAPS, CAPS

I am currently using:

    matches = re.findall('[A-Z\s]+', text, re.DOTALL)

But this returns:

    ['T', ' ', ' ', ' ', ' ALL CAPS', ' T', ' ', ' ', ' ', ' ', ' CAPS']

I clearly don't need punctuation or "T". I want to return only consecutive words, or one word that only includes the entire uppercase letter.

thank

+3

python regex

BHudson Apr 20 17 at 15:01

source to share

4 answers

Your regex relies on explicit conditions (space after letters).

matches = re.findall(r"([A-Z]+\s?[A-Z]+[^a-z0-9\W])",text)

Capture repetitions from A to Z if there is no lowercase or non-alphabetical character.

+1

Dashadower Apr 20 17 at 15:07

source to share

While keeping the regex, you can use strip()

and filter

:

string = "The following words are ALL CAPS. The following word is in CAPS."
result = filter(None, [x.strip() for x in re.findall(r"\b[A-Z\s]+\b", string)])
# ['ALL CAPS', 'CAPS']

+1

Pedro lobito Apr 20 17 at 15:28

source to share

Assuming you want to start and end the letter, and only include letters and spaces

\b([A-Z][A-Z\s]*[A-Z]|[A-Z])\b

| [AZ] to capture just me or A

0

Tezra Apr 20 17 at 15:08

source to share

Toto · Accepted Answer · 2017-04-20T15:20:02+0000

This does the job:

import re
text = "tHE following words aRe aLL CaPS. ThE following word Is in CAPS."
matches = re.findall(r"(\b(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)",text)
print matches

Output:

['tHE', 'aLL CaPS', 'ThE', 'Is', 'CAPS']

Explanation:

(           : start group 1
  \b        : word boundary
  (?:       : start non capture group
    [A-Z]+  : 1 or more capitals
    [a-z]?  : 0 or 1 small letter
    [A-Z]*  : 0 or more capitals
   |        : OR
    [A-Z]*  : 0 or more capitals
    [a-z]?  : 0 or 1 small letter
    [A-Z]+  : 1 or more capitals
  )         : end group
  \b        : word boundary
  (?:       : non capture group
    \s+     : 1 or more spaces
    (?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+) : same as above
    \b      : word boundary
  )*        : 0 or more time the non capture group
)           : end group 1

Capturing all consecutive words with spaces with regex in python?

More articles: