What regex pattern do I need?

I need a regex (to work in PHP) to replace American English words in HTML with English English words. So the color will be replaced by color, meters by meters, etc. [I know meters are also an English English word, but for the copy we will use, they will always refer to distance units, not measuring devices]. The pattern will have to work exactly in the following (slightly contrived) examples (although since I have no control over the actual input they might have existed):

<span style="color:red">This is the color red</span>

      

[should not replace color in HTML tag, but should replace it in sentence]

<p>Color: red</p>

      

[the word should be replaced]

<p>Tony Brammeter lives 2000 meters from his sister</p>

      

[should replace meters for word, but not for name]

I know there are edge cases where a replacement wouldn't be helpful (if his name was Tony Meter, for example), but they are rare enough that we can handle them when they come.

0


source to share


5 answers


The html / xml doesn't have to be parsed with regex, it is very difficult to create one that matches everyone . But you can use the built-in dom extension and process your string recursively:



# Warning: untested code!
function process($node, $replaceRules) {
    foreach ($node->children as $childNode) {
        if ($childNode instanceof DOMTextNode) {
            $text = pre_replace(
                array_keys(replaceRules),
                array_values($replaceRules),
                $childNode->wholeText
            );
            $node->replaceChild($childNode, new DOMTextNode($text));
        } else {
            process($childNode, $replaceRules);
        }
    }
}
$replaceRules = array(
    '/\bcolor\b/i' => 'colour',
    '/\bmeter\b/i' => 'metre',
);
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$htmlString = $doc->saveHTML();

      

+5


source


I think you need a dictionary and maybe even some kind of grammar analysis for it to work correctly, since you have no control over the input. A pure regex solution won't actually be able to handle such data correctly.



So I would suggest that you first come up with a list of words to replace, it's not just "color" and "meter". Wikipedia has some information on the topic .

+4


source


You don't want a regex for this. Regular expressions are stateless in nature and you need some kind of specific state to be able to tell the difference between "in the html tag" and "in the data".

Do you want to use an HTML parser in conjunction with something like str_replace, or even better, use the correct grammar dictionary and whatever Lucero suggests.

+1


source


The second problem is easier - you want to replace when there are words around the word: http://www.regular-expressions.info/wordboundaries.html - this will make sure you don't replace the counter in Brammeter.

The first problem is much more complicated. You don't want to replace words inside HTML entities - nothing between <> characters. So, your match should make sure you last saw> or nothing but never just <. It's either tricky, requires some combination of lookahead / lookbehind queries, or just isn't possible with regexes.

a script state machine implementation would be much better here.

+1


source


You don't need to explicitly use regex. You can try str_replace , or if you need it to be case insensitive , use str_ireplace .

Example:

$str = "<p>Color: red</p>";
$new_str = str_ireplace ('%color%', 'colour', $str);

      

You can pass an array with all the words you want to find instead of a string.

0


source







All Articles