Extract all text from XML data using python
I am new to processing XML data. I want to extract text data into the following XML file:
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
so the expected output is:
['12345','45667', 'abcde']
Currently I have tried:
tree = ET.parse('data.xml')
data = tree.getiterator()
text = [data[i].text for i in range(0, len(data))]
But the result shows only ['12345','45667']
. 'abcde'
is absent. Can anyone help me? Thanks in advance!
source to share
Try to do it with xpath and lxml:
import lxml.etree as etree
string = '''
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
'''
tree = etree.fromstring(string)
print(tree.xpath('//p//text()'))
The Xpath expression means: "select all p elements containing text recursively"
OUTPUT:
['12345', '45667', 'abcde']
source to share
getiterator()
(or its replacement iter()
) iterates over the child tags / elements, and abcde
is the text of the node, of the tail
tag strong
.
You can use the method itertext()
:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
print list(tree.find('p').itertext())
Printing
['12345', '45667', 'abcde']
source to share