Replace HTML tag content with Regex

I want to encrypt the text content of an HTML document without changing its layout. Content is stored in tag pairs, for example: <span style ...> text_to_get </SPAN>. My idea is to use Regex to extract (1) and replace each textual part with ciphertext (2). I am completing step (1) but problems occur in step (2). Here is the code I am working on:

private string encryptSpanContent(string text, string passPhrase, string salt, string  hash, int iteration, string initialVector, int keySize)        
{            
        string resultText = text;
        string pattern = "<span style=(?<style>.*?)>(?<content>.*?)</span>";   
        Regex regex = new Regex(pattern);
        MatchCollection matches = regex.Matches(resultText);          
        foreach (Match match in matches)    
        {                
            string replaceWith = "<span style=" + match.Groups["style"] + ">" + AESEncryption.Encrypt(match.Groups["content"].Value, passPhrase, salt, hash, iteration, initialVector, keySize) + "</span>";                
            resultText = regex.Replace(resultText, replaceWith);
        }
        return resultText;
}

      

Is this the wrong string (which replaces all texts with the last replaceWith value)?

            resultText = regex.Replace(resultText, replaceWith);

      

Can anyone help me fix this?

+3


source to share


2 answers


It is recommended to use the HTML Agility Pack if you are going to work with HTML as you may run into problems with regex, especially on nested tags or malformed HTML.

Assuming your HTML is well formed and you decide to use a regular expression, you should use a method Regex.Replace

that accepts MatchEvaluator

all occurrences to replace.

Try this approach:



string input = @"<div><span style=""color: #000;"">hello, world!</span></div>";
string pattern = @"(?<=<span style=""[^""]+"">)(?<content>.+?)(?=</span>)";
string result = Regex.Replace(input, pattern,
    m => AESEncryption.Encrypt(m.Groups["content"].Value, passPhrase, salt, hash, iteration, initialVector, keySize));

      

Here I am using the lambada expression for MatchEvaluator

and linking to the "content" group as shown above. I also use look-around for tags span

to avoid including them in the replacement pattern.

+3


source


Here is a simple solution to replace HTML tags



string ReplaceBreaks(string value)
{
    return Regex.Replace(value, @"<(.|\n)*?>", string.Empty);
}

      

-2


source







All Articles