Problem with data matching outside of html tags

Question

Problem with data matching outside of html tags

I'm trying to find a way to match content that doesn't exist inside any xml or html tags. I've read that using regex is fundamentally bad for parsing xml / html and I'm open to any solution that solves my problem, but if regex works too well.

Here's an example of what I'm looking for:

the lazy fox jumped <span>over</span> the brown fence.

I want to come back

the lazy fox jumped  the brown fence

Any ideas?

+2

regex

Joseph 11 Sep 09 at 19:26

source to share

2 answers

Run this line by line:

s / \ (. * \) <. *>. * <. *> \ (. * \) / \ 1 \ 2 /

You may have to change some details based on the implementation (for example, brace escaping may not be necessary), but this will be exactly what you want (with double space and everything in the middle).

+2

G gordon worley iii 11 Sep 09 at 19:41

source to share

Jim dagg · Accepted Answer · 2009-09-11T19:37:13+0000

This is probably a naive technique, but my first instinct would be to run a regex, figure out what text it matches in your parent string, and DELETE it from that string, returning the remainder. In pseudocode

String input = "whatever";
matches = Regex.Matches(input,"<.*>.*?</.*>");
foreach (match m in Matches)
{
input = input.Remove(m.Value);
}

Problem with data matching outside of html tags

More articles: