Regular expression with strange behavior

I've been trying to resolve this for the past 2 days ...

Please help me understand why this is happening. My intention is to simply choose <HDR>

with<DTL1 val="92">.....</HDR>

This is my regex

(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>

      

And the input line:

<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>

      

But this regex selects

abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>

      

Can anyone help me?

+3


source to share


2 answers


The regex mechanism will always give you the leftmost match in the string (even if you are using a non-greedy quantifier). This is exactly what you get.

So the solution is to prohibit the presence of the other <HDR>

in the parts described in .*?

, which is too permissive.

You have two techniques for this, you can replace .*?

with:

(?>[^<]+|<(?!/HDR))*

      

or with:

(?:(?!</HDR).)*+

      

In most cases, the first technique is more efficient, but if your string contains high density <

, the second method can also give good results.

Using a possessive quantifier or atomic group can reduce the number of steps to get the result, particularly in the event of a subpattern failure.

Example:



First:

(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>

      

or this option:

(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>

      

With the second way:

(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>

      

or this option:

(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>

      

+2


source


Casimir and Ippolit have already given you a couple of good solutions. I want to elaborate on a few things.

First, why doesn't your regex do what you want it to do: (?<=<HDR>).*?

tells it that it matches any number of characters, starting with the first character that is preceded <HDR>

, until it encounters what follows the non-living quantifier ( <DTL1...

). Well, the first character that is preceded <HDR>

is the first a

, so it matches everything from now until a fixed string is encountered <DTL1\sval="3"

.

Casimir et Hippolyte's solutions are intended for the generalized case where the content of <HDR> tags can be anything other than nested <HDR> tags. You can also do this with a positive outlook:

(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>

      

However, if the string is guaranteed in the structure shown, where the <HDR> tags only contain one or more <DTL1 val = "##"> tags, so you know they won't be closing tags inside, you could do it more efficiently. replacing the first .*?

with [^/]*

:

(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>

      

A negative character class is more efficient than a zero-width assertion, and if you use a negative character class, the greedy quantifier becomes more efficient than the lazy one.

Note also that by using lookbehind to match the opening <HDR>, you exclude it from the match, but you include the closing </HDR>. Are you sure you want? You fit it ...

<DTL1 val="3"><DTL2 val="4"></HDR>

      



... when you supposedly want it ...

<HDR><DTL1 val="3"><DTL2 val="4"></HDR>

      

... or that...

<DTL1 val="3"><DTL2 val="4">

      

So, in the first case, don't use lookbehind for the opening tag:

<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
<HDR>[^/]*<DTL1\sval="3".*?</HDR>

      

In the second case, use a closing tag:

(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>)
(?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)

      

+1


source







All Articles