What encoding is this ... and how do you avoid it in php?

I'm working on an imdb data scraper for a site and I seem to be coding everything in a weird encoding that I haven't seen before.

<a href="/keyword/exploding-ship/">Exploding&#xA0;Ship</a>
A Bug&#x27;s Life

      

Is there a php function that converts them to regular characters?

+2


source to share


2 answers


This is not an encoding, it is html entities of hexadecimal codes.

try



$converted = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

      

+5


source


These are SGML character escape sequences. They can be either decimal ( &#39;

) or hex ( &#xA0

) and refer directly to the Unicode code point.

html_entity_decode () should work in PHP 5. Although I can't check at the moment.



The first comment on this man page provides the following code for older PHP versions:

// For users prior to PHP 4.3.0 you may do this:
function unhtmlentities($string)
{
    // replace numeric entities
    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
    $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
    // replace literal entities
    $trans_tbl = get_html_translation_table(HTML_ENTITIES);
    $trans_tbl = array_flip($trans_tbl);
    return strtr($string, $trans_tbl);
}

      

+1


source







All Articles