Find all substrings with at least one group
I am trying to find in a string all substring that satisfies a condition.
Let's say we have a line:
s = 'some text 1a 2a 3 xx sometext 1b yyy some text 2b.'
I need to apply the search pattern {(one (group of words), two (other group of words), three (other group of words)), word}. The first three positions are optional, but there must be at least one of them. If so, I need a word after them. The output should be:
2a 1a 3 xx
1b yyy
2b
I wrote this expression:
find_it = re.compile(r"((?P<one>\b1a\s|\b1b\s)|" +
r"(?P<two>\b2a\s|\b2b\s)|" +
r"(?P<three>\b3\s|\b3b\s))+" +
r"(?P<word>\w+)?")
Each group contains many or different words (not 1a, 1b). And I cannot mix them into one group. This should be None
if the group is empty. Obviously, the result is wrong.
find_it.findall(s)
> 2a 1a 2a 3 xx
> 1b 1b yyy
I am grateful for your help!
source to share
You can use the following regex:
>>> reg=re.compile('((?:(?:[12][ab]|3b?)\s?)+(?:\w+|\.))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '2b.']
Here I am just shorthand for your regex using character class and modifier ?
. The following regex has 2 parts:
[12][ab]|3b?
[12][ab]
will meet 1a
, 1b
, 2a
, 2b
and 3b?
will correspond to 3b
and 3
.
And if you don't need a dot at the end 2b
, you can use the following regex using positive prediction , which is more general than the previous regex (since creating \s
optionally is not a good idea in the first group):
>>> reg=re.compile('((?:(?:[12][ab]|3b?)\s)+\w+|(?:(?:[12][ab]|3b?))+(?=\.|$))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '2b']
Also, if your numbers and example substrings are just instances, you can use [0-9][a-z]
as a general regex:
>>> reg=re.compile('((?:[0-9][a-z]?\s)+\w+|(?:[0-9][a-z]?)+(?=\.|$))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '5h 9 7y examole', '2b']
source to share