Regex will replace the reg trademark
I need help with regex:
I have html output and need to wrap all registered trademarks with <sup></sup>
I am unable to insert the tag <sup>
in the title and alt
properties, and obviously I don't need to wrap the cases that were already superscripted.
The following regex matches text that is not part of the HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should be output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
Thanks a lot for your time !!!
source to share
Well, here's an easy way if you agree with the following limitation:
Registers that have already been processed have </sup> following immediately after & reg;
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
Logics:
- we only replace those & reg; not followed by </sup> and ...
- not followed by> Symbol without opening <symbol
source to share
I would use an HTML parser instead of regular expressions, since HTML is not regular and will present more cases with edges than you can dream of (ignoring your contextual constraints you mentioned above).
You don't say what technology you are using. If you post this, someone might undoubtedly recommend an appropriate parser.
source to share
Regex is not enough for what you want. First you have to write code to detect when content is an attribute value or text of a node element. Then you have to go through all this content and use some kind of replacement method. I'm not sure what it is in PHP, but in JavaScript it looks something like this:
content[i].replace(/\®/g, "<sup>®</sup>");
I agree with Brian that regex is not a good way to parse HTML, but if you have to use regex you can try to split the string into tokens and then run a regex for each token.
I use preg_split
to split the string into HTML tags as well as a phrase <sup>®</sup>
- this will leave text that is either no longer superscript ®
or tagged as tokens. Then for each token ®
can be replaced with <sup>®</sup>
:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the result is not formatted as expected, it may not be parsed the way you would like it to (again, regex is not good for parsing HTML;))
source to share