Highlight whole words, omit HTML
I am writing C # code to parse RSS feeds and highlight specific whole words in content, however I only need to highlight words that are outside the HTML. So far I have:
string contentToReplace = "This is <a href=\"test.aspx\" alt=\"This is test content\">test</a> content";
string pattern = "\b(this|the|test|content)\b";
string output = Regex.Replace(contentToReplace, pattern, "<span style=\"background:yellow;\">$1</span>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
This works great, except that the word "test" is highlighted in the alt tag. I can easily write a function that splits the HTML and then replaces, but I need the HTML to render the content.
source to share
If the input is valid XHTML / XML, you can parse it into a tree structure (DOM / XLinq), traverse the tree recursively, replace all occurrences of keywords in text nodes, and finally serialize the tree structure back to a string.
Unconfirmed pseudocode:
XNode Highlight(XElement element, List<string> keywords)
{
var result = new XElement(element.Name);
// copy element attributes to result
foreach (var node in element)
{
if (node.Type == NodeType.Text)
{
var value = node.Value;
// while value contains keyword
// {
// add substring before keyword in value to result
// add new XElement with highlighted keyword to result
// remove consumed substring from value
// }
}
else if (node.Type == NodeType.Element)
{
result.Add(Highlight((XElement)node, keywords));
}
else
{
result.Add(node);
}
}
return result;
}
var output = Highlight(XElement.Parse(input), new List<string> {...}).ToString();
source to share
Another solution if you have valid XML but don't want to parse it: first split the input string into parts so that each part only contains a tag or text, but not both. For example:
"This is ",
"<a href=\"test.aspx\" alt=\"This is test content\">",
"test"
"</a>"
" content"
Then we iterate over the parts and apply the regex only to lines that do not start with '<'
. Finally, concatenate all the pieces to one line.
source to share
Here's a basic one.
private void Form1_Load(object sender, EventArgs e)
{
string contentToReplace = "This is <a href=\"test.aspx\" alt=\"This is test content\"> hello test world</a> content";
string pattern = @"(>{1}.*)(test)(.*<{1})";
string output = Regex.Replace(contentToReplace, pattern, "$1<span>$2</span>$3", RegexOptions.Singleline | RegexOptions.IgnoreCase);
//output is :
//This is <a href="test.aspx" alt="This is test content"> hello <span>test</span> world</a> content
MessageBox.Show(output);
Close();
}
source to share