Python libxml2 cannot parse Unicode strings
OK, the docs for the libxml2 bindings for Python are indeed ****
. My problem:
An XML document is stored in a string variable in Python. The string is a Unicode instance and contains non-ASCII characters. I want to parse it using libxml2 looking something like this:
# -*- coding: utf-8 -*-
import libxml2
DOC = u"""<?xml version="1.0" encoding="UTF-8"?>
<data>
<something>Bäääh!</something>
</data>
"""
xml_doc = libxml2.parseDoc(DOC)
with this result:
Traceback (most recent call last):
File "test.py", line 13, in <module>
xml_doc = libxml2.parseDoc(DOC)
File "c:\Python26\lib\site-packages\libxml2.py", line 1237, in parseDoc
ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 46-48:
ordinal not in range(128)
The dot is an ad u"..."
. If I replace it with a simple one ".."
, then everything is fine. Unfortunately it doesn't work in my setup because it DOC
will definitely be a Unicode instance.
Does anyone have any idea how libxml2 can be used to parse UTF-8 encoded strings?
source to share
XML is a binary format even though it looks like text. The encoding is specified at the beginning of the XML file to decode the XML bytes into text.
What you have to do is pass str
, not unicode
to your library:
xml_doc = libxml2.parseDoc(DOC.encode("UTF-8"))
(Although some tricks are possible with site.setencoding
if you are interested in reading or writing strings unicode
with automatic conversion through locale
.)
Edit: The Unicode article by Joel Spolsky is a good guide to string characters versus bytes, encodings, etc.
source to share
It should be
# -*- coding: utf-8 -*-
import libxml2
DOC = u"""<?xml version="1.0" encoding="UTF-8"?>
<data>
<something>Bäääh!</something>
</data>
""".encode("UTF-8")
xml_doc = libxml2.parseDoc(DOC)
.Encode ("UTF-8") is required to get the binary representation of a unicode string using utf8 encoding.
source to share