Figure out where CDATA is in the lxml element?

I need to parse and rebuild the file format used by a parser that speaks a language that can only be descriptively described as XML. I understand that standards compliant XML doesn't care about CDATA or whitespace, but unfortunately this application requires me to take care of everyone ...

I use lxml.etree

it because it is pretty good at saving CDATA.

For example:

s = '''
<root>
  <item>
     <![CDATA[whatever]]>
  </item>
</root>'''

import lxml.etree as et
et.fromstring(s, et.XMLParser(strip_cdata=False))
item = root.find('item')
print et.tostring(item)

      

Prints:

<item>
    <![CDATA[whatever]]>
  </item>

      

lxml

exactly preserved the formatting of the tag <item>

... great!

The problem is that I can't tell exactly where the CDATA starts and ends in the tag text. The property item.text

does not provide an indication of exactly how much of the text is wrapped in CDATA:

item.text
 ==> '\n     whatever\n  '

      

So, if I modify it and try to push it as CDATA, I will lose the space:

item.text = CDATA('foobar')
et.tostring(item)
 ==> '<item><![CDATA[foobar]]></item>\n'

      

It is clear that it lxml

"knows" where CDATA is in the text node, because it stores it with node.tostring()

. However, I cannot find a way to figure out which parts of the text are CDATA and which are not. Any advice?

+3


source to share


1 answer


I'm not sure about lxml

, but with minidom

you can change the CDATA section and keep the surrounding spaces as it CDATASection

is a separate node type.

>>> from xml.dom import minidom
>>> data = minidom.parseString(s)
>>> parts = data.getElementsByTagName('item')
>>> item = parts[0]
>>> item.childNodes
[<DOM Text node "u'\n     '">, <DOM CDATASection node "u'whatever'">, <DOM Text node "u'\n  '">]
>>> item.childNodes[1].nodeValue = 'changed'
>>> print item.toxml()
<item>
     <![CDATA[changed]]>
  </item>

      



See xml.dom.minidom for details: Getting CDATA Values .

+3


source







All Articles