How do I load XML when PHP cannot specify the correct encoding?

I am trying to load an XML source from a remote location, so I cannot control the formatting. Unfortunately, the XML file I'm trying to download doesn't have an encoding:

<ROOT xmlns:sql="urn:schemas-microsoft-com:xml-sql"> <NODE> </NODE> </ROOT>

      

When trying something like:

$doc = new DOMDocument( );
$doc->load(URI);

      

I get:

Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x38 0x2C 0x38

      

I've looked at ways to suppress this but no luck. How do I load this so I can use it with the DOMDocument?

+2


source to share


4 answers


You can edit the document ("preprocess it") to specify the encoding it delivers when you add the XML declaration. Whatever it is, you should definitely make sure, of course. Then the DOM object has to parse it.

Example XML declaration:



<?xml version="1.0" encoding="UTF-8" ?>

      

+1


source


You have to convert your document to UTF-8, the simplest would be to use utf8_encode () .

DOMdocument example:

$doc = new DOMDocument();
$content = utf8_encode(file_get_contents($url));
$doc->loadXML($content);

      

SimpleXML example:

$xmlInput = simplexml_load_string(utf8_encode(file_get_contents($url_or_file)));

      


If you don't know the current encoding use mb_detect_encoding () , for example:



$content = utf8_encode(file_get_contents($url_or_file));
$encoding = mb_detect_encoding($content);
$doc = new DOMdocument();
$res = $doc->loadXML("<?xml encoding='$encoding'>" . $content);

      

Notes:

  • If the encoding cannot be found (the function will return FALSE), you can try to force the encoding through utf8_encode () .
  • If you are loading the html code via $doc->loadHTML

    , you can still use the XML header.

If you know the encoding, use iconv () to convert it:

$xml = iconv('ISO-8859-1' ,'UTF-8', $xmlInput)

      

+1


source


You can try using XMLReader instead . XMLReader is designed specifically for XML and has options for using encoding (including "null" for none).

0


source


I faced a similar situation. I was getting an XML file that was supposed to be UTF-8 encoded, but it included some bad ISO characters.

I wrote the following code to encode bad characters to UTF-8

<?php

# The XML file with bad characters
$filename = "sample_xml_file.xml";

# Read file contents to a variable
$contents = file_get_contents($filename);

# Find the bad characters
preg_match_all('/[^(\x20-\x7F)]*/', $contents, $badchars);

# Process bad characters if some were found
if(isset($badchars[0]))
{
        # Narrow down the results to uniques only
        $badchars[0] = array_unique($badchars[0]);

        # Replace the bad characters with their UTF8 equivalents
        foreach($badchars[0] as $badchar)
        {
                $contents = preg_replace("/".$badchar."/", utf8_encode($badchar), $contents);
        }
}

# Write the fixed contents back to the file
file_put_contents($filename, $contents);

# Cleanup
unset($contents);

# Now the bad characters have been encoded to UTF8
# It will now load file with DOMDocument
$dom = new DOMDocument();
$dom->load($filename);

?>

      

I posted about the solution in more detail: http://dev.strategystar.net/2012/01/convert-bad-characters-to-utf-8-in-an-xml-file-with-php/

-1


source







All Articles