Parsing XML with Python - Access Elements

I am using lxml to parse some xml but for some reason I cannot find a specific element.

I am trying to access items <Constant>

.

Here is the xml snippet:

  </rdf:Description>
</rdf:RDF>
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>

      

The code I'm using looks like this:

    >>> from lxml import etree as ET
    >>> parsed = ET.parse('ct.cps')
    >>> root = parsed.getroot()    
    >>> for a in root.findall(".//Constant"):
    ...     print a.attrib['key']
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.get('key')
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.attrib['key']
    ... 

      

As you can see, none of these things work.

What am I doing wrong?


EDIT: I'm wondering if this has something to do with the items being <Constant>

empty?


EDIT2: Source xml here: https://www.dropbox.com/s/i6hga7nvmcd6rxx/ct.cps?dl=0

+2
python xml python-2.7 lxml


source to share


3 answers


This is how you can get the values ​​you are looking for:

from lxml import etree

parsed = etree.parse('ct.cps')

for a in parsed.findall("//{http://www.copasi.org/static/schema}Constant"):
    print a.attrib["key"]

      

Output:

Parameter_4344
Parameter_4343
Parameter_4342
Parameter_4341
Parameter_4340
Parameter_4339
Parameter_4338
Parameter_4337
Parameter_4336
Parameter_4335
Parameter_4334
Parameter_4333
Parameter_4332
Parameter_4331
Parameter_4330
Parameter_4329
Parameter_4328
Parameter_4327
Parameter_4326
Parameter_4325
Parameter_4324
Parameter_4323
Parameter_4322
Parameter_4321
Parameter_4320
Parameter_4319

      

The important thing is that the root element COPASI

in your XML file (real to the Dropbox url) declares a default namespace ( http://www.copasi.org/static/schema

). This means that the element and all of its descendants, including Constant

, belong to this namespace.

So instead of elements, Constant

you need to search for elements {http://www.copasi.org/static/schema}Constant

.



See http://lxml.de/tutorial.html#namespaces .


This is how you could have done it using XPath instead findall

:

from lxml import etree

NSMAP = {"c": "http://www.copasi.org/static/schema"}

parsed = etree.parse('ct.cps')

for a in parsed.xpath("//c:Constant", namespaces=NSMAP):
    print a.attrib["key"]

      

See http://lxml.de/xpathxslt.html#namespaces-and-prefixes .

+3


source to share


First, please ignore my comment. It turns out to be xml.etree

much better than the standard xml.etree.ElementTree

one as it takes care of the namespace. The problem is that you want to find '//Constant'

, which means that the nodes can be at any level. However, the root element does not allow this:

>>> root.findall('//Constant')
SyntaxError: cannot use absolute path on element

      

However, you can do it at a higher level:

>>> parsed.findall('//Constant')
[<Element Constant at 0x10a7ce128>, <Element Constant at 0x10a7ce170>]

      



Update

I am posting the full text here. Since I don't have a complete XML file, I am doing something to fill in the gap.

from lxml import etree as ET
from StringIO import StringIO

xml_text = """<?xml version='1.0' encoding='utf-8' ?>

<rdf:root  xmlns:rdf='http://foo.bar.com/rdf'>
<rdf:RDF>
  <rdf:Description>
    DescriptionX
  </rdf:Description>
</rdf:RDF>
<rdf:foo>
        <MiriamAnnotation>
          bar
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>
        </ListOfConstants>
</rdf:foo>
</rdf:root>
"""

buffer = StringIO(xml_text)
tree = ET.parse(buffer)
for constant_node in tree.findall('//Constant'):
    print constant_node.attrib['key']

      

0


source to share


Do not use findall

. It has a limited feature set and is designed to be compatible with ElementTree

.

Instead, use xpath

one that supports namespaces. From the above, it sounds like you probably want to say something like

# possibilities, you need to get these right...
ns_dict = {'atom':"http://www.w3.org/2005/Atom",,
    "rdf":"http://www.w3.org/2000/01/rdf-schema#" }

root = parsed.getroot()    
for a in root.xpath('.//rdf:Constant', namespaces=ns_dict):
    print a.attrib['key']

      

Note that you must include the namespace prefix in the expression xpath

whenever an element has a non-empty namespace, and they must match one of the namespace URLs that match the same URLs in your document.

Update

Since you posted your original document, I can see that there is no namespace assigned to the items you are looking for. This will work, I just tried it with the original document:

for a in tree.xpath("//Constant"):
    print a.attrib['key']

      

You don't need a namespace as there is no default namespace specified in the document itself.

0


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics