Lxml - Is there any hacky way to save & quot ;?

I noticed that the essence of the xml & quot automatically forced to convert them into real original characters:

>>> from lxml import etree as et
>>> parser = et.XMLParser()
>>> xml = et.fromstring("<root><elem>&quot;hello world&quot;</elem></root>", parser)
>>> print et.tostring(xml, pretty_print=1)
<root>
  <elem>"hello world"</elem>
</root>

>>> 

      

I found one related old (2009-02-07) thread :

s = cStringIO.StringIO ("" "This is MAN!" "") e = etree.parse (s, etree.XMLParser (resolve_entities = False))

Note that there is also etree.fromstring ().

etree.tostring (e) "She! MAN!"

     

I would expect resolve_entities = False to prevent translation to, for example, "to".

The "resolve_entities" parameter is for objects defined in the DTD from which you want to keep the reference instead of the resolved value. The objects you mention are part of the XML specification, not the DTD.

is there any other way to prevent this behavior (or if nothing else, reverse it after the fact)?

The good thing you get is well-formed XML. May I ask why you need object references in the output file?

However, the answer is why you want to do this, there is no direct answer to this problem. I am very surprised because the etree parser forcibly converts without giving me the option to disable it.

The following example shows why I need this solution, this xml for the xbmc skinning parser:

>>> print open("/tmp/so.xml").read() #the original file
<window id="1234">
        <defaultcontrol>101</defaultcontrol>
        <controls>
                <control type="button" id="101">
                        <onfocus>Dialog.Close(212)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="102">
                        <visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
                        <onfocus>RunScript(script.test)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="103">
                        <visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
                        <onfocus>Close</onfocus>
                        <onfocus>RunScript(&quot;/.xbmc/addons/script.hello.world/default.py&quot;,&quot;$INFO[VideoPlayer.Album]&quot;,&quot;$INFO[VideoPlayer.Genre]&quot;)</onfocus>
                </control>
        </controls>
</window>

>>> root = et.parse("/tmp/so.xml", parser)
>>> r = root.getroot()
>>> for c in r:
...     for cc in c:
...         if cc.attrib.get('id') == "103":
...             cc.remove(cc[1]) #remove 1 element, it just a demonstrate
... 
>>> o = open("/tmp/so.xml", "w")
>>> o.write(et.tostring(r, pretty_print=1)) #save it back
>>> o.close()
>>> print open("/tmp/so.xml").read() #the file after implemented 
<window id="1234">
        <defaultcontrol>101</defaultcontrol>
        <controls>
                <control type="button" id="101">
                        <onfocus>Dialog.Close(212)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="102">
                        <visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
                        <onfocus>RunScript(script.test)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="103">
                        <visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
                        <onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
                </control>
        </controls>
</window>

>>> 

      

As you can see the onfocus element under the id "103" at the end, & quot are no longer in their original form, and this results in an error if the variable "$ INFO [VideoPlayer.Album]" contains nested quotes and becomes "test", which was invalid and buggy.

So some hacky way can I keep the & quot; in its original form?

[UPDATE]: For those interested, the other 3 predefined xml entities ie gt , lt and amp will only be converted using method = "html" and script . Either lxml VS xml.etree.ElementTree or python2 VS python3 have the same mechanism and make people confused:

>>> from lxml import etree as et
>>> r = et.fromstring("<root><script>&quot;&apos;&amp;&gt;&lt;</script><p>&quot;&apos;&amp;&gt;&lt;</p></root>")
>>> print et.tostring(r, pretty_print=1, method="xml")
<root>
  <script>"'&amp;&gt;&lt;</script>
  <p>"'&amp;&gt;&lt;</p>
</root>

>>> print et.tostring(r, pretty_print=1, method="html")
<root><script>"'&><</script><p>"'&amp;&gt;&lt;</p></root>

>>> 

      

[UPDATE2]: Below is a list of all possible html tags:

#https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area',
'article', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button',
'canvas', 'caption', 'center', 'cite', 'code', 'col', 'colgroup',
'command', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn',
'dialog', 'dir', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset',
'figcaption', 'figure', 'footer', 'font', 'form', 'header', 'h1',
'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', 'ins',
'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map', 'menu', 'meter',
'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup', 'option',
'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select',
'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong',
'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot',
'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var', 'video']
from lxml import etree as et
for e in acceptable_elements:
    r = et.fromstring(e.join(["<", ">hello&amp;world</", ">"]))
    s = et.tostring(r, pretty_print=1, method="html")
    closed_tag = "</" + e + ">"
    if closed_tag not in s:
        print s

      

Run this code and you will see output like this:

<area>

<br>

<col>

<hr>

<img>

<input>

      

As you can see, only the tags are printed and the rest just fall into the black hole. I tested all 5 xml entities and they all have the same behavior. It's so confusing. This did not happen when using HTMLParser, so I guess there is an error between fromstring (method should be xml by default) and tostring (method = "html"). And I found that it has nothing to do with objects, because "<img> hello </img>" (no objects) gets truncated in <img> too (and hello just went nowhere, it can appear anytime, if the use = "xml" method is used for printing).

+3


source to share


1 answer


from xml.sax.saxutils import escape
from lxml import etree

def to_string(xdoc):
    r = ""
    for action, elem in etree.iterwalk(xdoc, events=("start", "end")):
        if action == 'start':
            text = escape(elem.text, {"'": "&apos;", "\"": "&quot;"}) if elem.text is not None else ""
            attrs = "".join([' %s="%s"' % (k, v) for k, v in elem.attrib.items()])
            r += "<%s%s>%s" % (elem.tag, attrs, text)
        elif action == 'end':
            r += "</%s>%s" % (elem.tag, elem.tail if elem.tail else "\n")
    return r
xdoc = etree.fromstring(xml_text)
s = to_string(xdoc)

      



+1


source







All Articles