Extract all text from XML data using python

Question

Extract all text from XML data using python

I am new to processing XML data. I want to extract text data into the following XML file:

<data>
    <p>12345<strong>45667</strong>abcde</p>
</data>

so the expected output is: ['12345','45667', 'abcde']

Currently I have tried:

tree = ET.parse('data.xml')
data = tree.getiterator()
text = [data[i].text for i in range(0, len(data))]

But the result shows only ['12345','45667']

. 'abcde'

is absent. Can anyone help me? Thanks in advance!

+3

python xml xml-parsing

Xueqing Liu 05 jan. At 18:57

source to share

2 answers

getiterator()

(or its replacement iter()

) iterates over the child tags / elements, and abcde

is the text of the node, of the tail

tag strong

.

You can use the method itertext()

:

import xml.etree.ElementTree as ET

tree = ET.parse('test.xml')
print list(tree.find('p').itertext())

Printing

['12345', '45667', 'abcde']

+1

alecxe 05 jan. 15 at 19:06

source to share

Gilles quenot · Accepted Answer · 2015-01-05T19:02:55+0000

Try to do it with xpath and lxml:

import lxml.etree as etree

string = '''
<data>
    <p>12345<strong>45667</strong>abcde</p>
</data>
'''

tree = etree.fromstring(string)

print(tree.xpath('//p//text()'))

The Xpath expression means: "select all p elements containing text recursively"

OUTPUT:

['12345', '45667', 'abcde']

Extract all text from XML data using python

OUTPUT:

More articles: