Insert a space after the comma unless it is part of the HTML entity

I am trying to insert a space after each semicolon, unless the colon is part of the HTML entity. The examples here are short, but my lines can be quite long, with a few semicolonies (or nothing).

Coca‑Cola =>     Coca‑Cola  (‑ is a non-breaking hyphen)
Beverage;Food;Music => Beverage; Food; Music

      

I found the following regex that does the trick for short strings:

<?php
$a[] = 'Coca&#8209;Cola';
$a[] = 'Beverage;Food;Music';
$regexp = '/(?:&#?\w+;|[^;])+/';
foreach ($a as $str) {
    echo ltrim(preg_replace($regexp, ' $0', $str)).'<br>';
}
?>

      

However, if the line is a bit large, the preg_replace

above is actually crashing my Apache server (the connection to the server was reset while the page was loading.) Add the following code above:

$a[] = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. '.
   'In blandit metus arcu. Fusce eu orci nulla, in interdum risus. '.
   'Maecenas ut velit turpis, eu pretium libero. Integer molestie '.
   'faucibus magna sagittis posuere. Morbi volutpat luctus turpis, '.
   'in pretium augue pellentesque quis. Cras tempor, sem suscipit '.
   'dapibus lacinia, dolor sapien ultrices est, eget laoreet nibh '.
   'ligula at massa. Cum sociis natoque penatibus et magnis dis '.
   'parturient montes, nascetur ridiculus mus. Phasellus nulla '.
   'dolor, placerat non sem. Proin tempor tempus erat, facilisis '.
   'euismod lectus pharetra vel. Etiam faucibus, lectus a '.
   'scelerisque dignissim, odio turpis commodo massa, vitae '.
   'tincidunt ante sapien non neque. Proin eleifend, lacus et '.
   'luctus pellentesque;odio felis.';

      

The above code (with a large line) dumps Apache, but works if I run PHP on the command line.

Elsewhere in my program, I use preg_replace

for much larger strings without issue, so I assume this regex is overloading PHP / Apache.

So, is there a way to "fix" the regex to work on Apache with large strings, or is there another safer way to do this?

I am using PHP 5.2.17 with Apache 2.0.64 on Windows XP SP3 if that helps. (Unfortunately, updating PHP or Apache is not an option right now.)

+3


source to share


3 answers


I would suggest the following expression:

\b(?<!&)(?<!&#)\w+;

      

... which matches a series of characters (letters, numbers, and underscores) that are not preceded by an ampersand (or an ampersand followed by a hash character), but followed by a semicolon.

it breaks down into:

\b          # assert that this is a word boundary
(?<!        # look behind and assert that you cannot match
 &          # an ampersand
)           # end lookbehind
(?<!        # look behind and assert that you cannot match
 &#         # an ampersand followed by a hash symbol
)           # end lookbehind
\w+         # match one or more word characters
;           # match a semicolon

      

replace with string '$0 '

let me know if it doesn't work for you



You can of course also use [a-zA-Z0-9]

instead \w

to avoid matching semicolons, but I don't think this has ever given you any problems.

In addition, you may need to escape the hash character as well (because it is a regex comment character), for example:

\b(?<!&)(?<!&\#)\w+;

      

EDIT Not sure, but my guess is that setting the word boundary at the beginning will make it a little more efficient (and therefore less likely to crash your server), so I changed that in the expressions and parsing ...

EDIT 2 ... and a little more information on why your expression might crash your server: Catastrophic Rollback - I think this is applicable (?) Hmmm .... good info nonetheless

FINAL IMAGE , if you want to add a space after the semicolon if there is no space after it (i.e. add one in the case pellentesque;odio

, but not in the case pellentesque; odio

), then add an extra end to the end, which will prevent any extra extra spaces from being added:

\b(?<!&)(?<!&\#)\w+;(?!\s)

      

+2


source


You can use negative appearance:

preg_replace('/(?<=[^\d]);([^\s])/', '; \1', $text)

      



Not tested as I don't have a computer on hand, but this or a slight change should work.

0


source


A callback can help with this problem.

(&(?:[A-Za-z_:][\w:.-]*|\#(?:[0-9]+|x[0-9a-fA-F]+)))?;

      

Extended

(          # Capture buffer 1
   &                              # Ampersand '&'
   (?: [A-Za-z_:][\w:.-]*         # normal words
     | \#                         # OR, code '#'
       (?: [0-9]+                       # decimal
         | x[0-9a-fA-F]+                # OR, hex 'x'
       )
   )
)?         # End capture buffer 1, optional
;          # Semicolon ';'

      

Testcase http://ideone.com/xYrpg

<?php

$line = '
  Coca&#8209;Cola
  Beverage;Food;Music
';

$line = preg_replace_callback(
        '/(&(?:[A-Za-z_:][\w:.-]*|\#(?:[0-9]+|x[0-9a-fA-F]+)))?;/',
        create_function(
            '$matches',
            'if ($matches[1])
               return $matches[0];
             return $matches[0]." ";'
        ),
        $line
    );
echo $line;
?> 

      

0


source







All Articles