How to match a hexadecimal sequence of characters and replace it with a space in PHP

I have some text that needs to be cleared of some characters. These symbols are shown in the photos that I attached to the question. I want to replace them with a space x20

.

First hexadecimal sequence

Second hexadecimal sequence

My attempt was to use preg_replace

.

$result = preg_replace("/[\xef\x82\xac\x09|\xef\x81\xa1\x09]/", "\x20", $string);

      

In a specific case, this approach works, but in some cases it won't because, for example, I had a semicolon text and matched x82

and removed it from that text.

How could I write my regular expression to search for this particular sequence ef 82 ac 09

or another ef 81 a1 09

, and not for each pair separately, how ef

82

ac

09

?

+3


source to share


2 answers


1.) You are matching any of six different hex bytes or pipe character in a character class. Probably wanted to use a group (?:

... |

... )

to map different byte sequences.

2.) Also the byte sequences do not match the image. It looks like you messed up two bytes. The picture shows: ef 82 a1 09

and ef 81 ac 09

against your attempt: \xef\x82\xac\x09

|\xef\x81\xa1\x09

3.) When testing your sample input

$str = "de la nouvelle;      Fourniture $         Option :";

foreach(preg_split("//u", $str) AS $v) {
  var_dump($v, bin2hex($v)); echo "\n";
}

      



it turned out to 09

be too much. The characters to be removed are actually ef81ac

and ef82a1

. So the correct regex would be(?:\xef\x81\xac|\xef\x82\xa1)

$result = preg_replace("/(?:\xef\x81\xac|\xef\x82\xa1)/", "\x20", $string);

      

See test at eval.in

0


source


If the content of the entire file is UTF-8 encoded text, you may need to remove characters from the Private Use Area because it \xef\x82\xac

decodes U + F0AC and \xef\x81\xa1

decodes U + F061 which belongs to Private U + E000..U + F8FF.

$result = preg_replace("~\p{Co}~u", " ", $input);

      



\p{Co}

is the character class of all characters belonging to the Other, Private Use category in Unicode, which includes all characters in the 3 ranges U + E000..U + F8FF, U + F0000..U + FFFFD, U + 100000..U + 10FFFD.

0


source







All Articles