Read hierarchical (tree) XML into pandas framework while preserving hierarchy
I have an XML document that contains a hierarchical tree structure, see example below.
The document contains several tags <Message>
(I just copied one of them for convenience).
Everyone <Message>
has some associated data ( id
, status
, priority
) on their own.
Additionally, each <Message>
can contain one or more child elements <Street>
, which again have some relevant data ( <name>
, <length>
).
Also, everyone <Street>
can have one or more children <Link>
, which again have their respective data ( <id>
, <direction>
).
Sample XML Document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'>
<status>Active</status>
<priority>Low</priority>
<Area>
<Streets>
<Street>
<name>King Street</name>
<length>Short</length>
<Link>
<id>75838745</id>
<direction>North</direction>
</Link>
<Link>
<id>168745</id>
<direction>South</direction>
</Link>
<Link>
<id>975416</id>
<direction>North</direction>
</Link>
</Street>
<Street>
<name>Queen Street</name>
<length>Long</length>
<Link>
<id>366248</id>
<direction>West</direction>
</Link>
<Link>
<id>745812</id>
<direction>East</direction>
</Link>
</Street>
</Streets>
</Area>
</Message>
</Messages>
</Root>
Parsing XML with Python and storing the relevant data in variables is not a problem - I can use, for example, a library lxml
and either read the whole document, or execute some expressions xpath
to get the corresponding fields, or read it line by line using the iterparse
.
However, I would like to put the data in a pandas framework while keeping the hierarchy in it. The goal is to query for individual messages (for example, booleans like if status == Active then get the Message with all its streets and its streets' links
) and get all the data related to a particular message (its streets and links to its streets). What is the best way to do this?
I've tried different approaches but ran into all the problems.
If I create one dataframe row for each XML row that contains information, and then set MultiIndex to [MessageID, StreetName, LinkID]
, I end up with an index with a lot NaN
in it (which is usually not recommended) as I MessageID
don't know your children streets
and links
. Also, I would not know how to select some sub-dataset by boolean state instead of only getting individual rows without its children.
When you do GroupBy on [MessageID, StreetName, LinkID]
, I don't know how to return (like MultiIndex) the dataframe from the pandas GroupBy object, since there is nothing to do here (no / std / sum / whatsoever value, the values ββmust remain the same).
Any suggestions how this could be handled efficiently?
source to share
I finally managed to solve the problem as described above and here's how to do it.
I have expanded the above XML document to include two messages instead of one. This is how it looks like a valid Python string (it can also be loaded from a file):
xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \
<status>Active</status> \
<priority>Low</priority> \
<Area> \
<Streets> \
<Street> \
<name>King Street</name> \
<length>Short</length> \
<Link> \
<id>75838745</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>168745</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>975416</id> \
<direction>North</direction> \
</Link> \
</Street> \
<Street> \
<name>Queen Street</name> \
<length>Long</length> \
<Link> \
<id>366248</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>745812</id> \
<direction>East</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
<Message id='54321'> \
<status>Inactive</status> \
<priority>High</priority> \
<Area> \
<Streets> \
<Street> \
<name>Princess Street</name> \
<length>Mid</length> \
<Link> \
<id>744154</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>632214</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>654785</id> \
<direction>East</direction> \
</Link> \
</Street> \
<Street> \
<name>Prince Street</name> \
<length>Very Long</length> \
<Link> \
<id>1022444</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>4474558</id> \
<direction>South</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
</Messages> \
</Root>'''
To parse the XML hierarchical structure into a flat pandas framework, I used the Python ElementTree method iterparse
, which provides a SAX-like interface to iterate through the XML document in turn and fire events if certain XML tags are triggered or end.
For each parsed XML string, this information is stored in a dictionary. There are three dictionaries in use: one for each dataset that belongs to each other in some way (message, street, link), and this must be stored in its own dataframe string later. When all the information in one such line is collected, the dictionary is added to the list, which stores all the lines in the appropriate order.
This is what parsing XML looks like (see inline comments for further explanation):
# imports
import xml.etree.ElementTree as ET
import pandas as pd
# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))
# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}
# initialize list that stores the single dataframe rows
listOfRows = []
# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):
##########
# get all information on the current message and store in the appropriate dictionary
##########
# get current message id attribute
if event == 'start' and element.tag == 'Message':
messageRow = {} # re-initialize the dictionary for the current row
messageRow['messageId'] = element.get('id')
# get current message status
if event == 'end' and element.tag == 'status':
messageRow['status'] = element.text
# get current message priority
if event == 'end' and element.tag == 'priority':
messageRow['priority'] = element.text
# when no more information on the current message is expected, append it to the list of rows
if event == 'end' and element.tag == 'priority':
listOfRows.append(messageRow)
##########
# get all information on the current street and store in row dictionary
##########
if event == 'end' and element.tag == 'name':
streetRow = {} # re-initialize the dictionary for the current street row
streetRow['streetName'] = element.text
if event == 'end' and element.tag == 'length':
streetRow['streetLength'] = element.text
# when no more information on the current street is expected, append it to the list of rows
if event == 'end' and element.tag == 'length':
# link the street to the message it belongs to, then append
streetRow['messageId'] = messageRow['messageId']
listOfRows.append(streetRow)
##########
# get all information on the current link and store in row dictionary
##########
if event == 'end' and element.tag == 'id':
linkRow = {} # re-initialize the dictionary for the current link row
linkRow['linkId'] = element.text
if event == 'end' and element.tag == 'direction':
linkRow['direction'] = element.text
# when no more information on the current link is expected, append it to the list of rows
if event == 'end' and element.tag == 'direction':
# link the link to the message it belongs to, then append
linkRow['messageId'] = messageRow['messageId']
listOfRows.append(linkRow)
listOfRows
is now a list of dictionaries, in which each dictionary stores information to be placed on one line of the data frame. Creating a DataFrame with this list as the data source can be done with
# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)
and gives the "raw" dataframe:
messageId status priority streetName streetLength linkId \
0 12345 Active Low NaN NaN NaN
1 12345 NaN NaN King Street Short NaN
2 12345 NaN NaN NaN NaN 75838745
3 12345 NaN NaN NaN NaN 168745
4 12345 NaN NaN NaN NaN 975416
5 12345 NaN NaN Queen Street Long NaN
6 12345 NaN NaN NaN NaN 366248
7 12345 NaN NaN NaN NaN 745812
8 54321 Inactive High NaN NaN NaN
9 54321 NaN NaN Princess Street Mid NaN
10 54321 NaN NaN NaN NaN 744154
11 54321 NaN NaN NaN NaN 632214
12 54321 NaN NaN NaN NaN 654785
13 54321 NaN NaN Prince Street Very Long NaN
14 54321 NaN NaN NaN NaN 1022444
15 54321 NaN NaN NaN NaN 4474558
direction
0 NaN
1 NaN
2 North
3 South
4 North
5 NaN
6 West
7 East
8 NaN
9 NaN
10 West
11 South
12 East
13 NaN
14 North
15 South
We can now represent the columns of interest (messageId, streetName, linkId) as MultiIndex on this dataframe:
# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)
which gives:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
Even though the overall index NaN
is not counted in the index, I don't see any problem with this for this utility.
Finally, to get the desired effect of accessing single posts using messageId
, including all of its "child" streets and links, the MultiIndexed framework must be grouped by the outermost index level:
# group by the most outer index
groups = df.groupby(level='messageId')
Now you can, for example, bypass all messages (and do everything with them) with
# iterate over all groups
for key, group in groups:
print('key: ' + key)
print('group:')
print(group)
print('\n')
which returns
key: 12345
group:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
key: 54321
group:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
or you can access specific messages messageId by returning a string containing the messageId, along with all highlighted streets and links:
# get groups by key
print('specific group only:')
print(groups.get_group('54321'))
gives
specific group only:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
Hope this is helpful to someone someday.
source to share