Fights special characters (html_entity_decode, iconv, etc.)

Question

Fights special characters (html_entity_decode, iconv, etc.)

I am struggling to get a bunch of characters translated to basic utf-8 to store in my database.

PHP iconv doesn't work on many symbols, so I had to create my own "solution" which is not really a solution if it doesn't work and it doesn't work entirely on Windows, so development with iconv is mostly useless as I have to "dev" on the test server. Also, since iconv skips a ton of characters, it's not very useful at all.

Here I have my function doing

function replace_accents ($ string) { 
  return str_replace (array ('à', 'á', 'â', 'ã', 'ä', 'ç', 'è', 'é', 'ê', 'ë', 'ì', ' í ',' î ',' ï ',' ñ ',' ò ',' ó ',' ô ',' õ ',' ö ',' ù ',' ú ',' û ',' ü ' , 'ý', 'ÿ', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Ç', 'È', 'É', 'Ê', 'Ë', ' Ì ',' Í ',' Î ',' Ï ',' Ñ ',' Ò ',' Ó ',' Ô ',' Õ ',' Ö ',' Ù ',' Ú ',' Û ' , 'Ü', 'Ý'), array ('a', 'a', 'a', 'a', 'a', 'c', 'e', 'e', 'e', 'e ',' i ',' i ',' i ',' i ',' n ',' o ', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'A', ' A ',' A ',' A ',' C ',' E ',' E ',' E ',' E ',' I ',' I ',' I ',' I ',' N ' , 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y'), $ string);, 'Y'), $ string);, 'Y'), $ string);
} 


function replaceQuote ($ string) {
$ replaceQuote = array ('' ',' '', '"', '"', '' ',' ‚','„ ',' '', "'");
    return str_replace ($ replaceQuote, '\' ', $ string);
}

function replaceArray ($ string) {
$ replaceArray = array ('-', '™', '& TRADE;', '™', '©', '®', '®', '©',
                    '¡',
                    '¡',
                    '¢',
                    '¢',
                    '£',
                    '£',
                    '¤',
                    '¥',
                    '¥',
                '¦',
            '§',
                '§',
            '"',
            '"',
            '¬',
            '¬',
            '',
            '¯',
            '¯',
        '²',
            '³',
            'µ',
            'µ',
            '¶',
            '¶',
            '·',
            '·',
            '¸',
            '¸',
            '¹',
        'º',
        'º', '"', '‹', '"', '¼', '½', '¾', '♥', '☆', '☠', '░', '▒', '▓ ',' █ ',' ★ ',
'♪', '♫', '◄', '▀', '', '►', '¤', '^', '☣', '…', '†', '‡', '. : ',' ♣ ',' Ξ ',' ξ ',' & Rarr; ',' ⇒ ',' → ',' & Larr; ',' ⇐ ',' ← ',
'⇔', '↔', '™', '♠', '& loz', '√', '∩', '& Cap', '∴');
  return str_replace ($ replaceArray, '', $ string);
  }

function special_replace ($ string) {
   $ replace_from = array ('ƒ', 'Œ', 'œ', '•', '-', '-', '˜', 'š', 'Š', 'Ÿ', 'ÿ', ' ε ',
   '€', 'α', 'Α', 'τ', 'Τ', 'θ', 'Θ');

   $ replace_to = array ('ƒ', 'Œ', 'œ', '•', '-', '-', '~', 'š', 'Š', 'Ÿ', 'ÿ', ' ε ',' € ',' α ',' Α ',' τ ',' Τ ',' θ ',' Θ ');
 return str_replace ($ replace_from, $ replace_to, $ string);


}

function dbSlug ($ slugIt) {
$ slugIt = html_entity_decode ($ slugIt);

$ slugIt = replaceArray ($ slugIt);
$ slugIt = replaceQuote ($ slugIt);
$ slugIt = special_replace ($ slugIt);

// $ slugIt = iconv ('ISO-8859-1', 'UTF-8 // TRANSLIT // IGNORE', $ slugIt);
$ slugIt = replace_accents ($ slugIt);
$ slugIt = trim ($ slugIt);
        return $ slugIt;

    }

This may seem inefficient, since sometimes I have the same symbol in several replacement functions, but I use functions in different places in different ways, so I may have the same symbol in more than one of my replacement functions.

Now the problem is that every time I go and look at the data, I find ANOTHER special character that doesn't fall into my maze of finding and replacing / removing characters.

The current abusive character is what you think is pretty harmless. ”Which end in the database as“ Â. ”Not all spaces of your mind seem to only affect some of the whitespace (I haven't figured out why yet).

I've been doing this for over a week now and every time I go back and take a look I have more to add to the fix.

I am not asking how to remove the "Â", I am hoping to get a solution on how to maintain the integrity of the content / data but not have the special characters that sometimes get overloaded when moving data around and making it searchable.

I would do

preg_replace ("/ [^ a-zA-Z0-9, - \ '-! &. etc] /", "", $ data);

but it worries me that I will start screwing words when special characters that have been omitted are replaced. I already had this experience when "México" came out "Mxico", so it just doesn't work.

The character encoding must be UTF-8, although I tried changing the header to ISO-8859-1 before encoding, or not setting any encoding, and I always get the same result.

I'm sure I have probably the worst way to do this, but I haven't been able to find an efficient solution. Any suggestions? What worries me is that it almost never ends, and I always find new characters to run through my translation maze.

+2

php mysql character-encoding

pedalpete 23 Aug '09 at 18:07

source to share

2 answers

Alix axel · Answer 1 · 2009-08-23T19:02:17+0000

Save your PHP files as UTF-8.
After connecting, execute SET NAMES 'UTF8';

If you still need to replace characters, follow these steps:

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

EDIT

$string = html_entity_decode(preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8')), ENT_COMPAT, 'UTF-8');

Modesto · Answer 2 · 2013-04-03T17:56:44+0000

you can use html_entity_decode ($ strint, ENT_QUOTES, 'UTF-8')

I was having problems with Spanish special characters. With this I solved it

Fights special characters (html_entity_decode, iconv, etc.)

More articles: