Correct way to use cyrillic in python lxml library

I am trying to generate XML files with Cyrillic characters inside. But the result is unexpected. What's the easiest way to avoid this result? Example:

from lxml import etree

root = etree.Element('')

print(etree.tostring(root))

      

I get:

b'<&#1087;&#1088;&#1080;&#1084;&#1077;&#1088;/>'

      

Isted:

b'</>'

      

+3


source to share


1 answer


etree.tostring()

without additional arguments, outputs the ASCII data as an object only bytes

. You can use etree.tounicode()

:

>>> from lxml import etree
>>> root = etree.Element('')
>>> print(etree.tostring(root))
b'<&#1087;&#1088;&#1080;&#1084;&#1077;&#1088;/>'
>>> print(etree.tounicode(root))
</>

      

or specify a codec with an argument encoding

; you will still get a byte, so the result will need to be decoded again:



>>> print(etree.tostring(root, encoding='utf8'))
b'<\xd0\xbf\xd1\x80\xd0\xb8\xd0\xbc\xd0\xb5\xd1\x80/>'
>>> print(etree.tostring(root, encoding='utf8').decode('utf8'))
</>

      

Setting the encoding to not unicode

gives you the same output tounicode()

produces and is the preferred spelling:

>>> print(etree.tostring(root, encoding='unicode'))
</>

      

+3


source







All Articles