Parsing XML with Python - Access Elements
I am using lxml to parse some xml but for some reason I cannot find a specific element.
I am trying to access items <Constant>
.
Here is the xml snippet:
</rdf:Description>
</rdf:RDF>
</MiriamAnnotation>
<ListOfSubstrates>
<Substrate metabolite="Metabolite_5" stoichiometry="1"/>
</ListOfSubstrates>
<ListOfModifiers>
<Modifier metabolite="Metabolite_9" stoichiometry="1"/>
</ListOfModifiers>
<ListOfConstants>
<Constant key="Parameter_4344" name="Kcat" value="433.724"/>
<Constant key="Parameter_4343" name="km" value="479.617"/>
The code I'm using looks like this:
>>> from lxml import etree as ET
>>> parsed = ET.parse('ct.cps')
>>> root = parsed.getroot()
>>> for a in root.findall(".//Constant"):
... print a.attrib['key']
...
>>> for a in root.findall('Constant'):
... print a.get('key')
...
>>> for a in root.findall('Constant'):
... print a.attrib['key']
...
As you can see, none of these things work.
What am I doing wrong?
EDIT: I'm wondering if this has something to do with the items being <Constant>
empty?
EDIT2: Source xml here: https://www.dropbox.com/s/i6hga7nvmcd6rxx/ct.cps?dl=0
This is how you can get the values you are looking for:
from lxml import etree
parsed = etree.parse('ct.cps')
for a in parsed.findall("//{http://www.copasi.org/static/schema}Constant"):
print a.attrib["key"]
Output:
Parameter_4344
Parameter_4343
Parameter_4342
Parameter_4341
Parameter_4340
Parameter_4339
Parameter_4338
Parameter_4337
Parameter_4336
Parameter_4335
Parameter_4334
Parameter_4333
Parameter_4332
Parameter_4331
Parameter_4330
Parameter_4329
Parameter_4328
Parameter_4327
Parameter_4326
Parameter_4325
Parameter_4324
Parameter_4323
Parameter_4322
Parameter_4321
Parameter_4320
Parameter_4319
The important thing is that the root element COPASI
in your XML file (real to the Dropbox url) declares a default namespace ( http://www.copasi.org/static/schema
). This means that the element and all of its descendants, including Constant
, belong to this namespace.
So instead of elements, Constant
you need to search for elements {http://www.copasi.org/static/schema}Constant
.
See http://lxml.de/tutorial.html#namespaces .
This is how you could have done it using XPath instead findall
:
from lxml import etree
NSMAP = {"c": "http://www.copasi.org/static/schema"}
parsed = etree.parse('ct.cps')
for a in parsed.xpath("//c:Constant", namespaces=NSMAP):
print a.attrib["key"]
See http://lxml.de/xpathxslt.html#namespaces-and-prefixes .
First, please ignore my comment. It turns out to be xml.etree
much better than the standard xml.etree.ElementTree
one as it takes care of the namespace. The problem is that you want to find '//Constant'
, which means that the nodes can be at any level. However, the root element does not allow this:
>>> root.findall('//Constant')
SyntaxError: cannot use absolute path on element
However, you can do it at a higher level:
>>> parsed.findall('//Constant')
[<Element Constant at 0x10a7ce128>, <Element Constant at 0x10a7ce170>]
Update
I am posting the full text here. Since I don't have a complete XML file, I am doing something to fill in the gap.
from lxml import etree as ET
from StringIO import StringIO
xml_text = """<?xml version='1.0' encoding='utf-8' ?>
<rdf:root xmlns:rdf='http://foo.bar.com/rdf'>
<rdf:RDF>
<rdf:Description>
DescriptionX
</rdf:Description>
</rdf:RDF>
<rdf:foo>
<MiriamAnnotation>
bar
</MiriamAnnotation>
<ListOfSubstrates>
<Substrate metabolite="Metabolite_5" stoichiometry="1"/>
</ListOfSubstrates>
<ListOfModifiers>
<Modifier metabolite="Metabolite_9" stoichiometry="1"/>
</ListOfModifiers>
<ListOfConstants>
<Constant key="Parameter_4344" name="Kcat" value="433.724"/>
<Constant key="Parameter_4343" name="km" value="479.617"/>
</ListOfConstants>
</rdf:foo>
</rdf:root>
"""
buffer = StringIO(xml_text)
tree = ET.parse(buffer)
for constant_node in tree.findall('//Constant'):
print constant_node.attrib['key']
Do not use findall
. It has a limited feature set and is designed to be compatible with ElementTree
.
Instead, use xpath
one that supports namespaces. From the above, it sounds like you probably want to say something like
# possibilities, you need to get these right...
ns_dict = {'atom':"http://www.w3.org/2005/Atom",,
"rdf":"http://www.w3.org/2000/01/rdf-schema#" }
root = parsed.getroot()
for a in root.xpath('.//rdf:Constant', namespaces=ns_dict):
print a.attrib['key']
Note that you must include the namespace prefix in the expression xpath
whenever an element has a non-empty namespace, and they must match one of the namespace URLs that match the same URLs in your document.
Update
Since you posted your original document, I can see that there is no namespace assigned to the items you are looking for. This will work, I just tried it with the original document:
for a in tree.xpath("//Constant"):
print a.attrib['key']
You don't need a namespace as there is no default namespace specified in the document itself.