How do I load XML when PHP cannot specify the correct encoding?
I am trying to load an XML source from a remote location, so I cannot control the formatting. Unfortunately, the XML file I'm trying to download doesn't have an encoding:
<ROOT xmlns:sql="urn:schemas-microsoft-com:xml-sql"> <NODE> </NODE> </ROOT>
When trying something like:
$doc = new DOMDocument( );
$doc->load(URI);
I get:
Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x38 0x2C 0x38
I've looked at ways to suppress this but no luck. How do I load this so I can use it with the DOMDocument?
You have to convert your document to UTF-8, the simplest would be to use utf8_encode () .
DOMdocument example:
$doc = new DOMDocument();
$content = utf8_encode(file_get_contents($url));
$doc->loadXML($content);
SimpleXML example:
$xmlInput = simplexml_load_string(utf8_encode(file_get_contents($url_or_file)));
If you don't know the current encoding use mb_detect_encoding () , for example:
$content = utf8_encode(file_get_contents($url_or_file));
$encoding = mb_detect_encoding($content);
$doc = new DOMdocument();
$res = $doc->loadXML("<?xml encoding='$encoding'>" . $content);
Notes:
- If the encoding cannot be found (the function will return FALSE), you can try to force the encoding through utf8_encode () .
- If you are loading the html code via
$doc->loadHTML
, you can still use the XML header.
If you know the encoding, use iconv () to convert it:
$xml = iconv('ISO-8859-1' ,'UTF-8', $xmlInput)
source to share
You can try using XMLReader instead . XMLReader is designed specifically for XML and has options for using encoding (including "null" for none).
source to share
I faced a similar situation. I was getting an XML file that was supposed to be UTF-8 encoded, but it included some bad ISO characters.
I wrote the following code to encode bad characters to UTF-8
<?php
# The XML file with bad characters
$filename = "sample_xml_file.xml";
# Read file contents to a variable
$contents = file_get_contents($filename);
# Find the bad characters
preg_match_all('/[^(\x20-\x7F)]*/', $contents, $badchars);
# Process bad characters if some were found
if(isset($badchars[0]))
{
# Narrow down the results to uniques only
$badchars[0] = array_unique($badchars[0]);
# Replace the bad characters with their UTF8 equivalents
foreach($badchars[0] as $badchar)
{
$contents = preg_replace("/".$badchar."/", utf8_encode($badchar), $contents);
}
}
# Write the fixed contents back to the file
file_put_contents($filename, $contents);
# Cleanup
unset($contents);
# Now the bad characters have been encoded to UTF8
# It will now load file with DOMDocument
$dom = new DOMDocument();
$dom->load($filename);
?>
I posted about the solution in more detail: http://dev.strategystar.net/2012/01/convert-bad-characters-to-utf-8-in-an-xml-file-with-php/
source to share