Overlay on Regex with zero width lookbehind positive assertion

I have a string and I would like to find all larger characters that are not part of the HTML tag.

Ignoring CDATA, etc., this should be easy: find any ">" that either does not have a "<" in front of it, or there is another ">" in between.

Here's the first solution I took:

 (?<=(^|>)[^<]*)>

      

I think it should look for any ">" where there are no "<" characters to the left of it, either back to the beginning of the line or back to the previous ">".

I also tried to express it negatively:

 (?<!<[^>]*)>

      

Ie, a ">" not preceded by "<" unless followed by non- ">" characters.

I suspect I’m just getting twisted in my head about what the work looks like.

Unit tests:

 No match in: <foo>
 No match in: <foo bar>
 Match in: <foo> bar>
 Match in: foo> bar
 Match in: >foo
 Two matches in: foo>>
 Two matches in: <foo> >bar>

      

Use case: I am stripping HTML from a form field in a wiki that accepts some HTML tags, but users are not very good at HTML and sometimes enter non-isolated ">" and "<". literals for actual values ​​less than or greater than values. I intend to replace them with HTML entities, but only if they are not part of the HTML tag. I know there is an option to enter text like "Heigh is <10 and> 5" that might break this, but this is an edge case I can work or live with.

+2


source to share


2 answers


This is much more complicated than it seems at first (as you open). It's much easier to approach it from a different direction: use one regex to match an HTML tag OR an angle bracket. If it's a tag that you found, you plug it back in; otherwise, you transform it. The Replace method with the MatchEvaluator parameter is good for this:

static string ScrubInput(string input)
{
  return Regex.Replace(input, @"</?\w+>|[<>]", GetReplacement);
}

static string GetReplacement(Match m)
{
  switch (m.Value)
  {
    case "<":
      return "&lt;";
    case ">":
      return "&gt;";
    default:
      return m.Value;
  }
}

      



You will notice that my regex tag </?\w+>

is more restrictive than yours. I don't know if this is the right one for your needs, but I would suggest using <[^<>]+>

- it would find a match in something like "if (x<3||x>9)"

.

+3


source


Get Express, a great tool for working with and regular expressions

To be honest, I don't know if you can write one to do what you need to do.
Keep in mind that some html tags are not needed to be valid html, and some are themselves closed in xhtml.



eg. <hr>, <br/>, <p>, <li> <img> or <img /> etc

      

You might be better off just keeping the list of valid tags, changing all <and> signs &lt;

and &gt;

that are not part of valid tags.

0


source







All Articles