How to convert & lt; in <in lxml, Python?

There is an xml file there:

<body>
    <entry>
         I go to <hw>to</hw> to school.
    </entry>
</body>

      

For some reason I changed <hw>

to &lt;hw&gt;

and </hw>

to &lt;/hw&gt;

before parsing it with the lxml parser.

<body>
    <entry>
         I go to &lt;hw&gt;to&lt;/hw&gt; to school.
    </entry>
</body>

      

But after changing the parsed XML data, I want to get the element <hw>

, not &lt;hw&gt;

. How can i do this?

+3


source to share


2 answers


First find the function unescape

:

from xml.sax.saxutils import unescape

entry=body[0]

      



unescape and replace it with the original:

body.replace(entry, e.fromstring(unescape(e.tounicode(entry))))

      

+3


source


If you know which element contains incorrectly escaped elements:



# parse whole document as usual..
# find the entry element..
# parse the fragment
fragment = lxml.fromstring(entry.text)
# (optionally) add the fragment to the tree
entry.text = None
entry.append(fragment)

      

+1


source







All Articles