Match in specific locations
This is a follow-up to this question (doesn't ask me). While trying to answer, I ran into several problems.
Consider a string strings123[abc789<span>123</span>def<span>456</span>000]strings456
, how would I match numbers in square brackets that are not tagged span
in Python
(using a newer module regex
)?
In the example line, this would be 789
and 000
.
I played with
\G
like ( demo )
(?:\G(?!\A)|\[)
[^\d\]]*
\K
\d+
and (*SKIP)(*FAIL)
( demo ):
<span>.*?</span>(*SKIP)(*FAIL)
|
\d+
But failed to merge statement :
<span>.*?</span>(*SKIP)(*FAIL)
|
(?:
(?:\G(?!\A)|\[)
[^\d\]]*
(\d+)
[^\d\]]*
\K
)
How can I do that?
source to share
One of the things I love about the Pyge regex module is that it supports infinite lookbehind width:
- Variable-length lookbehind
Lookbehind can match a variable length string.
>>> import regex
>>> s = 'strings123[abc789<span>123</span>def<span>456</span>000]strings456'
>>> rx = r'(?<=\[[^][]*)(?:<span>[^<]*</span>(*SKIP)(?!)|\d+)(?=[^][]*])'
>>> regex.findall(rx, s)
['789', '000']
>>>
Template details :
-
(?<=\[[^][]*)
- must be[
followed by zero or more characters other than]
and[
, immediately to the left of the current location -
(?:
- launching a group without capturing-
<span>[^<]*</span>(*SKIP)(?!)
- match a<span>
, then 0+ characters except<
(with[^<]*
negative character class) and then a</span>
and cancel the match, staying at the end position of the match and keep looking for the next match -
|
- or -
\d+
- 1 + numbers
-
-
(?=[^][]*])
- must be]
after zero or more characters other than]
and[
, immediately to the right of the current location.
source to share
I was thinking about an algorithm that looks like this.
-
Find the square brackets and the content inside it and store the result in a variable. Regex will be
\[[^]]*\]
. -
Now find the tags
<span>
and replace it with-
just for the simplicity of the next step. Regex will be(<span>.*?</span>)
. -
Now you are left with the content of the square brackets, other than tags
<span>
. Just search\d+
to match the numbers.
source to share