Parsing an XML file using an ordered dictionary

I have a xml

form file :

<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>

      

I need to handle it so that, for example, when the user enters nd

, the program matches it to the tag <Phonetic>

and returns and

from the part <Phonemic>

. I thought maybe if I could convert the XML file to a dictionary, I could loop through the data and find information as needed.

I searched and found xmltodict which is used for the same purpose:

import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
    obj = xmltodict.parse(fd.read())

      

Running this gives me ordered dict

:

>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])

      

Now this, unfortunately, did not make things easier and I am not sure how to start implementing the program with a new data structure. For example, to access nd

it I need to write:

obj['NewDataSet']['Root'][0]['Phonetic']

      

which is ridiculously difficult. I tried to turn it into a regular dictionary dict()

, but since it is nested, the inner layers remain orderly and my data is so big.

+3


source to share


2 answers


If you refer to it like obj['NewDataSet']['Root'][0]['Phonetic']

, IMO, you are not doing it right.

Instead, you can do the following

obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]

      

Although this code looks a lot longer, the advantage is that it will be much more compact and modular once you start working with a large enough xml.



PS: I had the same problems with xmltodict

. But instead of parsing using xml.etree.ElementTree to parse the XML of the xmltodict files, it was much easier to work with as the code base was smaller and I didn't "I have to deal with other xml module nonsense."

EDIT

The following code works for me

import xmltodict
from collections import OrderedDict

xmldata = """<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>"""

obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]

      

+5


source


Mu's answer worked for me, the only thing I had to change was the tricky one to ensure that root_element is always a list . Step: -



import xmltodict
from collections import OrderedDict

xmldata = """<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>"""

obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj["Root"]) == list else [obj["Root"]] 
# Above step ensures that root_elements is always a list
# Is obj["Root"] a list already, then use obj["Root"], otherwise make single element list.
for element in root_elements:
    print element["Phonetic"]

      

0


source







All Articles