Python libxml2 cannot parse Unicode strings

Question

Python libxml2 cannot parse Unicode strings

OK, the docs for the libxml2 bindings for Python are indeed ****

. My problem:

An XML document is stored in a string variable in Python. The string is a Unicode instance and contains non-ASCII characters. I want to parse it using libxml2 looking something like this:

# -*- coding: utf-8 -*-
import libxml2

DOC = u"""<?xml version="1.0" encoding="UTF-8"?>
<data>
  <something>Bäääh!</something>
</data>
"""

xml_doc = libxml2.parseDoc(DOC)

with this result:

Traceback (most recent call last):
  File "test.py", line 13, in <module>
    xml_doc = libxml2.parseDoc(DOC)
  File "c:\Python26\lib\site-packages\libxml2.py", line 1237, in parseDoc
    ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 46-48:
ordinal not in range(128)

The dot is an ad u"..."

. If I replace it with a simple one ".."

, then everything is fine. Unfortunately it doesn't work in my setup because it DOC

will definitely be a Unicode instance.

Does anyone have any idea how libxml2 can be used to parse UTF-8 encoded strings?

+2

python xml unicode libxml2

Boldewyn 14 oct. 09 at 21:20

source to share

2 answers

It should be

# -*- coding: utf-8 -*-
import libxml2

DOC = u"""<?xml version="1.0" encoding="UTF-8"?>
<data>
  <something>Bäääh!</something>
</data>
""".encode("UTF-8")

xml_doc = libxml2.parseDoc(DOC)

.Encode ("UTF-8") is required to get the binary representation of a unicode string using utf8 encoding.

+9

Peter Hoffmann 14 oct. 09:35 pm

source to share

Andrey Vlasovskikh · Accepted Answer · 2009-10-14T21:34:46+0000

XML is a binary format even though it looks like text. The encoding is specified at the beginning of the XML file to decode the XML bytes into text.

What you have to do is pass str

, not unicode

to your library:

xml_doc = libxml2.parseDoc(DOC.encode("UTF-8"))

(Although some tricks are possible with site.setencoding

if you are interested in reading or writing strings unicode

with automatic conversion through locale

.)

Edit: The Unicode article by Joel Spolsky is a good guide to string characters versus bytes, encodings, etc.

Python libxml2 cannot parse Unicode strings

More articles: