Correct way to use cyrillic in python lxml library

Question

Correct way to use cyrillic in python lxml library

I am trying to generate XML files with Cyrillic characters inside. But the result is unexpected. What's the easiest way to avoid this result? Example:

from lxml import etree

root = etree.Element('')

print(etree.tostring(root))

I get:

b'<&#1087;&#1088;&#1080;&#1084;&#1077;&#1088;/>'

Isted:

b'</>'

+3

python xml lxml cyrillic

Aleksey Papushin Apr 20 15 at 14:25

source to share

1 answer

Martijn pieters · Accepted Answer · 2015-04-20T14:32:44+0000

etree.tostring()

without additional arguments, outputs the ASCII data as an object only bytes

. You can use etree.tounicode()

:

>>> from lxml import etree
>>> root = etree.Element('')
>>> print(etree.tostring(root))
b'<&#1087;&#1088;&#1080;&#1084;&#1077;&#1088;/>'
>>> print(etree.tounicode(root))
</>

or specify a codec with an argument encoding

; you will still get a byte, so the result will need to be decoded again:

>>> print(etree.tostring(root, encoding='utf8'))
b'<\xd0\xbf\xd1\x80\xd0\xb8\xd0\xbc\xd0\xb5\xd1\x80/>'
>>> print(etree.tostring(root, encoding='utf8').decode('utf8'))
</>

Setting the encoding to not unicode

gives you the same output tounicode()

produces and is the preferred spelling:

>>> print(etree.tostring(root, encoding='unicode'))
</>

Correct way to use cyrillic in python lxml library

More articles: