Encoding odd HTML entities & lstroke; '

I'm having problems with some odd HTML entities that are coming from an XML file that I have to parse in PHP 5.6.

Some of the HTML entities:

&lstroke;
n´
a&hook;
e&hook;

      

XML comes from CAB Abstracts ( http://www.cabi.org/publishing-products/online-information-resources/cab-abstracts/ ) and its header is:

<?xml version="1.0" encoding="ISO-8859-1"?>

      

However, I have tried several coding systems without success. Also, I tried to use them directly in HTML files by writing them from PHP 5.6 using html_entity_decode like this:

$strings = array('&Sacute;wia&hook;tek', 'Kie&lstroke;kiewicz', 'Zagdan&acute;ska', 'Mie&hook;tkiewski');

foreach ($strings as $s) {
    foreach (array(
            'ISO-8859-1', 'ISO-8859-5', 'ISO-8859-15', 'UTF-8',
            'cp866', 'cp1251', 'cp1252', 'KOI8-R', 'BIG5', 'GB2312',
            'BIG5-HKSCS', 'Shift_JIS', 'EUC-JP', 'MacRoman', '') as $l) {
        print $l . ' ==> ';
        print html_entity_decode($s, ENT_COMPAT | ENT_QUOTES | ENT_XML1 | ENT_XHTML | ENT_HTML5, $l) . '<br>';
    }
}

      

Nothing works!!

I would like to avoid any solution that involves parsing the XML file replacing these objects with the right-hand UTF-8 character. I can't foresee when odd HTML entities like these will be included and the files will be relatively large.

The line should look like this:

Świątek
Kiełkiewicz
Zagdańska 
Miętkiewski

      

So the question is:

How can I decode these odd HTML entities to UTF-8 in PHP?

+3


source to share


1 answer


It looks like its own standard for encoding Polish letters. There will be no functionality. The official diacritics for Ą, ą, Ę and ę are called light (in both Polish and English). &acute;

is for spacing variant, union should be used in this context.



I think the best option is to encode the output to UTF-8 and use strtr()

special characters for all these. You don't need to parse the XML, you can treat it like plain text.

+1


source







All Articles