Handling nested elements with Python lxml

Question

Handling nested elements with Python lxml

Given the simple XML data below:

<book>
   <title>My First Book</title>
   <abstract>
         <para>First paragraph of the abstract</para>
         <para>Second paragraph of the abstract</para>
    </abstract>
    <keywordSet>
         <keyword>First keyword</keyword>
         <keyword>Second keyword</keyword>
         <keyword>Third keyword</keyword>
    </keywordSet>
</book>

How can I traverse the tree using lxml and get all the paragraphs in the "abstract" element as well as all the keywords in the "keywordSet" element?

Below is the code snippet that only returns the first line of text in each element:

from lxml import objectify
root = objectify.fromstring(xml_string) # xml_string contains the XML data above
print root.title # returns the book title
for line in root.abstract:
    print line.para # returns only yhe first paragraph
for word in root.keywordSet:
    print word.keyword # returns only the first keyword in the set

I tried following this example , but the code above does not work as expected.

Alternatively, it would be even better to read the entire XML tree in a Python dictionary, with each element as a key and each text being the element (s) of the element. I found out that something like this is possible using lxml objectify, but I couldn't figure out how to achieve it.

One very big problem I found when trying to write XML parsing in Python is that most of the "examples" provided are simply too simple and completely fictional to be of much help - otherwise they are just the opposite, using overly complex auto-generated XML data!

Can someone give me a hint?

Thanks in advance!

EDIT: After posting this question, I found a simple solution here .

So my updated code will look like this:

from lxml import objectify
    root = objectify.fromstring(xml_string) # xml_string contains the XML data above
    print root.title # returns the book title
    for para in root.abstract.iterchildren():
        print para # now returns the text of all paragraphs
    for keyword in root.keywordSet.iterchildren():
        print keyword # now returns all keywords in the set

+3

python xml lxml

maurobio 14 oct. 14 at 20:52

source to share

1 answer

Lukas Graf · Accepted Answer · 2014-10-14T21:01:11+0000

It's pretty simple using XPath :

from lxml import etree

tree = etree.parse('data.xml')

paragraphs = tree.xpath('/abstract/para/text()')
keywords = tree.xpath('/keywordSet/keyword/text()')

print paragraphs
print keywords

Output:

['First paragraph of the abstract', 'Second paragraph of the abstract']
['First keyword', 'Second keyword', 'Third keyword']

For more details on XPath syntax, see the XPath tutorial at W3Schools .

Specifically, the elements used in the above expressions use

Selector /

to select the root node / immediate children.
Operator text()

for selecting the text node ("text content") of matching elements.

Here's how it can be done using the Objectify API:

from lxml import objectify

root = objectify.fromstring(xml_string)

paras = [p.text for p in root.abstract.para]
keywords = [k.text for k in root.keywordSet.keyword]

print paras
print keywords

It seems root.abstract.para

to be actually shorthand for root.abstract.para[0]

. Therefore you need to explicitly use element.iterchildren()

to access all child elements.

This is not the case, we obviously both misunderstood the Objectify API: To iterate over para

in abstract

, you need to iterate over root.abstract.para

, not root.abstract

. This is weird because you intuitively think of it abstract

as a collection or container for your nodes, and that container will be represented by an iterable Python. But it is actually a selector .para

that represents a sequence.

Handling nested elements with Python lxml

More articles: