How can I get a set of values ​​from nested HTML type elements using RegExp?

I am having a problem generating a regex for the following task:

Suppose we have HTML-like text like this:

<x>...<y>a</y>...<y>b</y>...</x>

      

I want to get a set of values ​​within tags <y></y>

located within a given tag <x>

, so the above example will result in a set of two elements ["a", "b"].

In addition, we know that:

  • Tags
  • <y>

    cannot be enclosed in other tags <y>

  • ...

    can contain any text or other tags.

How can I achieve this using RegExp?

+1


source to share


4 answers


This is a job for the HTML / XML parser . You can do this with regular expressions, but that would be very messy. There are examples on the page I linked to.



+9


source


I take my word for this:

"y" tags cannot be enclosed in other "y" tags

input looks like: <x>...<y>a</y>...<y>b</y>...</x>

      

and the fact that everything else is also not nested and formatted correctly. (Disclaimer: If it isn't, it's not my fault.)

First, find the content of any X tags with a loop over matches of this:

<x[^>]*>(.*?)</x>

      

Then (in the body of the loop) find any Y tags in the match group 1 of the "outer" match at the top:

<y[^>]*>(.*?)</y>

      



Pseudo-code:

input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re  = "<x[^>]*>(.*?)</x>"
y_re  = "<y[^>]*>(.*?)</y>"

for each x_match in input.match_all(x_re)
  for each y_match in x_match.group(1).value.match_all(y_re)
    print y_match.group(1).value
  next y_match
next x_match

      

Pseudo-output:

a
b

      


Further clarification in the comments showed that any X element has an arbitrary number of Y elements. This means that there can be no regular expression that matches them and retrieves their contents.

+3


source


Shorter and simpler: use XPath :)

+1


source


It would help if we knew what language or tool you are using; there is a wide variety of syntax, semantics and possibilities. Here's one way to do it in Java:

String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
  System.out.println(m.group(1));
}

      

Once I have matched <y>

, I use a lookahead to confirm that there is </x>

somewhere ahead, but not there <x>

between the current position and it. Assuming the pseudo-HTML is reasonably well formed, this means that the current match position is inside the "x" element.

I've used possessive quantifiers because they make things like that much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of supporting possessive quantifiers are PHP and JGS tools (RegexBuddy / PowerGrep / EditPad Pro). On the other hand, many languages ​​provide a way to get all the matches at once, but in Java I had to code my own loop to do this.

So this work can be done with a single regex, but it is very complex, and both the regex and the attached code must be adapted to the language you are working in.

0


source







All Articles