Regular expression with strange behavior
I've been trying to resolve this for the past 2 days ...
Please help me understand why this is happening. My intention is to simply choose <HDR>
with<DTL1 val="92">.....</HDR>
This is my regex
(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>
And the input line:
<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>
But this regex selects
abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>
Can anyone help me?
source to share
The regex mechanism will always give you the leftmost match in the string (even if you are using a non-greedy quantifier). This is exactly what you get.
So the solution is to prohibit the presence of the other <HDR>
in the parts described in .*?
, which is too permissive.
You have two techniques for this, you can replace .*?
with:
(?>[^<]+|<(?!/HDR))*
or with:
(?:(?!</HDR).)*+
In most cases, the first technique is more efficient, but if your string contains high density <
, the second method can also give good results.
Using a possessive quantifier or atomic group can reduce the number of steps to get the result, particularly in the event of a subpattern failure.
Example:
First:
(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>
or this option:
(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>
With the second way:
(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
or this option:
(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
source to share
Casimir and Ippolit have already given you a couple of good solutions. I want to elaborate on a few things.
First, why doesn't your regex do what you want it to do: (?<=<HDR>).*?
tells it that it matches any number of characters, starting with the first character that is preceded <HDR>
, until it encounters what follows the non-living quantifier ( <DTL1...
). Well, the first character that is preceded <HDR>
is the first a
, so it matches everything from now until a fixed string is encountered <DTL1\sval="3"
.
Casimir et Hippolyte's solutions are intended for the generalized case where the content of <HDR> tags can be anything other than nested <HDR> tags. You can also do this with a positive outlook:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
However, if the string is guaranteed in the structure shown, where the <HDR> tags only contain one or more <DTL1 val = "##"> tags, so you know they won't be closing tags inside, you could do it more efficiently. replacing the first .*?
with [^/]*
:
(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>
A negative character class is more efficient than a zero-width assertion, and if you use a negative character class, the greedy quantifier becomes more efficient than the lazy one.
Note also that by using lookbehind to match the opening <HDR>, you exclude it from the match, but you include the closing </HDR>. Are you sure you want? You fit it ...
<DTL1 val="3"><DTL2 val="4"></HDR>
... when you supposedly want it ...
<HDR><DTL1 val="3"><DTL2 val="4"></HDR>
... or that...
<DTL1 val="3"><DTL2 val="4">
So, in the first case, don't use lookbehind for the opening tag:
<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR> <HDR>[^/]*<DTL1\sval="3".*?</HDR>
In the second case, use a closing tag:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>) (?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)
source to share