Using Lookahead to Match String Using Regular Expression

I need to match html to html strings using regex to pull out all nested spans, I guess I guess there is a way to do it with regex, but didn't make it all morning.

So, for an example input line

<DIV id=c445c9c2-a02e-4cec-b254-c134adfa4192 style="BORDER-RIGHT: #000000 1px solid; BORDER-TOP: #000000 1px solid; BORDER-LEFT: #000000 1px solid; BORDER-BOTTOM: #000000 1px solid; BACKGROUND-COLOR: #eeeeee">
<SPAN id=b8db8cd1-f600-448f-be26-2aa56ea09a9c>
<SPAN id=304ccd38-8161-4def-a557-1a048c963df4>
<IMG src="http://avis.co.uk/Assets/build/menu.gif">
</SPAN>
</SPAN>
<SPAN id=bc88c866-5370-4c72-990b-06fbe22038d5>
<SPAN id=55b88bbe-15ca-49c9-ad96-cecc6ca7004e>UK<BR></SPAN>
</SPAN>
<SPAN id=52bb62ca-8f0a-42f1-a13b-9b263225ff1d>
<SPAN id=0e1c3eb6-046d-4f07-96c1-d1ac099d5f1c>
<IMG src="http://avis.co.uk/Assets/build/menu.gif">
</SPAN>
</SPAN>
<SPAN id=4c29eef2-cd77-4d33-9828-e442685a25cb>
<SPAN id=0d5a266a-14ae-4a89-9263-9e0ab57f7ad2>Italy</SPAN>
</SPAN>
<SPAN id=f0a72eea-fddd-471e-89e6-56e9b9efbece>
<SPAN id=b7d9ada7-ade0-49fe-aa5f-270237e87c2b>
<IMG src="http://avis.co.uk/Assets/build/menu.gif">
</SPAN>
</SPAN>
<SPAN id=7604df94-34ba-4c89-bf11-125df01731ff>
<SPAN id=330d6429-4f1b-46a2-a485-9001e2c6b8c1>Netherlands</SPAN>
</SPAN>
<SPAN id=a18fb516-451e-4c32-ab31-3e3be29235f6>
<SPAN id=6c70238d-78f9-468f-bb8d-370fff13c909>
<IMG src="http://avis.co.uk/Assets/build/menu.gif">
</SPAN>
</SPAN>
<SPAN id=5a2465eb-b337-4f94-a4f8-6f5001dfbd75>
<SPAN id=47877a9e-a7d5-4f13-a41e-6948f899e385>Malta &amp; Gozo

      

i would like to get each outer range and its containing range, so there should be eight results in the above text

Any help was gladly accepted

+1


source to share


4 answers


Try the following:

@"(?is)<SPAN\b[^>]*>\s*(<SPAN\b[^>]*>.*?</SPAN>)\s*</SPAN>"

      



This is basically the same as PhiLho regex, except that it allows whitespace between tags at both ends. I also had to add the SingleLine / DOTALL modifier to place line separators in consistent text. I don't know if any of these changes were actually made; the sample data posted by the OP was on one line, but PhiLho broke it (thus breaking its own regex).

+1


source


Use an HTML parser again to traverse the DOM: regular expressions will never be reliable enough to do this.



+5


source


This is not actually possible to solve with a standard regex, since they basically implement type 3 grammars in the Chomsky hierarchy (finite state machines), whereas in order to correctly recognize arbitrary nested structures, you need at least a type 2 grammar (some stack type or recursion).

However, if you are limiting the maximum possible nesting level, it might be possible, but I still doubt if the correct expressions are correct.

+4


source


Basically, I agree with the advice above, using regular expressions to parse HTML, asking for code that once violates weird legal HTML constructs (not to mention the invalid HTML that browsers accept ...). Finding and using a good HTML parser can be helpful in many ways ...

Now I'm pragmatic (and I can't resist a little regex call ...) and sometime I use REs for computer generated HTML (often an export function) because I know the structure I see is not will change, as opposed to hand-crafted pages where the author might make typos ... This is mainly for quick hacks that I can adapt if the output ever changes.

In your case the HTML is pretty regular, linear and predictable, so the RE is pretty straightforward. I'm giving Java code because I don't know C #, but the adaptation should be trivial.

Pattern p = Pattern.compile("(<SPAN id.*?<SPAN id.*?</SPAN></SPAN>)");
Matcher m = p.matcher(html);
while (m.find())
{
  System.out.println(m.group(1));
}

      

NTN.

0


source







All Articles