Problem with data matching outside of html tags

I'm trying to find a way to match content that doesn't exist inside any xml or html tags. I've read that using regex is fundamentally bad for parsing xml / html and I'm open to any solution that solves my problem, but if regex works too well.

Here's an example of what I'm looking for:

the lazy fox jumped <span>over</span> the brown fence.

      

I want to come back

the lazy fox jumped  the brown fence

      

Any ideas?

+2


source to share


2 answers


This is probably a naive technique, but my first instinct would be to run a regex, figure out what text it matches in your parent string, and DELETE it from that string, returning the remainder. In pseudocode



String input = "whatever";
matches = Regex.Matches(input,"<.*>.*?</.*>");
foreach (match m in Matches)
{
input = input.Remove(m.Value);
}

      

+1


source


Run this line by line:

s / \ (. * \) <. *>. * <. *> \ (. * \) / \ 1 \ 2 /


You may have to change some details based on the implementation (for example, brace escaping may not be necessary), but this will be exactly what you want (with double space and everything in the middle).

+2


source







All Articles