How to match a hexadecimal sequence of characters and replace it with a space in PHP
I have some text that needs to be cleared of some characters. These symbols are shown in the photos that I attached to the question. I want to replace them with a space x20
.
My attempt was to use preg_replace
.
$result = preg_replace("/[\xef\x82\xac\x09|\xef\x81\xa1\x09]/", "\x20", $string);
In a specific case, this approach works, but in some cases it won't because, for example, I had a semicolon text and matched x82
and removed it from that text.
How could I write my regular expression to search for this particular sequence ef 82 ac 09
or another ef 81 a1 09
, and not for each pair separately, how ef
82
ac
09
?
source to share
1.) You are matching any of six different hex bytes or pipe character in a character class. Probably wanted to use a group (?:
... |
... )
to map different byte sequences.
2.) Also the byte sequences do not match the image. It looks like you messed up two bytes. The picture shows: ef 82 a1 09
and ef 81 ac 09
against your attempt: \xef\x82\xac\x09
|\xef\x81\xa1\x09
3.) When testing your sample input
$str = "de la nouvelle; Fourniture $ Option :";
foreach(preg_split("//u", $str) AS $v) {
var_dump($v, bin2hex($v)); echo "\n";
}
it turned out to 09
be too much. The characters to be removed are actually ef81ac
and ef82a1
. So the correct regex would be(?:\xef\x81\xac|\xef\x82\xa1)
$result = preg_replace("/(?:\xef\x81\xac|\xef\x82\xa1)/", "\x20", $string);
See test at eval.in
source to share
If the content of the entire file is UTF-8 encoded text, you may need to remove characters from the Private Use Area because it \xef\x82\xac
decodes U + F0AC and \xef\x81\xa1
decodes U + F061 which belongs to Private U + E000..U + F8FF.
$result = preg_replace("~\p{Co}~u", " ", $input);
\p{Co}
is the character class of all characters belonging to the Other, Private Use category in Unicode, which includes all characters in the 3 ranges U + E000..U + F8FF, U + F0000..U + FFFFD, U + 100000..U + 10FFFD.
source to share