Parsing an XML file using an ordered dictionary
I have a xml
form file :
<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>
I need to handle it so that, for example, when the user enters nd
, the program matches it to the tag <Phonetic>
and returns and
from the part <Phonemic>
. I thought maybe if I could convert the XML file to a dictionary, I could loop through the data and find information as needed.
I searched and found xmltodict which is used for the same purpose:
import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
obj = xmltodict.parse(fd.read())
Running this gives me ordered dict
:
>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])
Now this, unfortunately, did not make things easier and I am not sure how to start implementing the program with a new data structure. For example, to access nd
it I need to write:
obj['NewDataSet']['Root'][0]['Phonetic']
which is ridiculously difficult. I tried to turn it into a regular dictionary dict()
, but since it is nested, the inner layers remain orderly and my data is so big.
source to share
If you refer to it like obj['NewDataSet']['Root'][0]['Phonetic']
, IMO, you are not doing it right.
Instead, you can do the following
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
Although this code looks a lot longer, the advantage is that it will be much more compact and modular once you start working with a large enough xml.
PS: I had the same problems with xmltodict
. But instead of parsing using xml.etree.ElementTree to parse the XML of the xmltodict files, it was much easier to work with as the code base was smaller and I didn't "I have to deal with other xml module nonsense."
EDIT
The following code works for me
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
source to share
Mu's answer worked for me, the only thing I had to change was the tricky one to ensure that root_element is always a list . Step: -
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj["Root"]) == list else [obj["Root"]]
# Above step ensures that root_elements is always a list
# Is obj["Root"] a list already, then use obj["Root"], otherwise make single element list.
for element in root_elements:
print element["Phonetic"]
source to share