Element Tree iter () skips random elements

I am trying to parse an XML file using the Element Tree functions iterparse () and iter () in Python. Here is a link to the file on Google Drive: https://drive.google.com/file/d/0B_S2Z7quow3TMl9yUk51ZzZ5UW8/view?usp=sharing .

XML file is a collection of data on court cases; it is broken down into a series of "n-document" elements, each of which contains subelements containing court-specific data. I am trying to retrieve all descriptions of descriptions. Below is a simplified version of the code:

import numpy as np
import pandas as pd
import xml.etree.ElementTree as etree
import re
import csv

for event, elem in etree.iterparse("***fileName***", events=("start", "end")):
    if event == "start":
        if elem.tag == "docket.entry":
            for element in elem.iter():
                print element.tag
                if element.text != None:
                    print element.text
                if element.tail != None:
                    print element.tail
                    print "from tail"
    elem.clear()

      

The problem is that in the first case (1613 HARVARD LIMITED PARTNERSHIP V. DISTRICT OF COLUMBIA ET AL) the document description numbered 25 (they are numbered in descending order) skips the text and the tail of the element tagged "gateway.image.link". Specifically, here is the result I am getting. I just canceled the build after a second and scrolled to the very top of the console:

docket.entry
number.block
number
28
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
ORDER GRANTING DEFENDANTS' MOTION TO DISMISS AND DENYING PLAINTIFF MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)
docket.entry
number.block
number
27
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)
docket.entry
number.block
number
26
image.block
image.gateway.link
gateway.image.link
date
03/31/2007
docket.description
MEMORANDUM ORDER GRANTING DEFENDANTS' MOTION
image.gateway.link
21
gateway.image.link
21
 TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS' DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)
from tail
docket.entry
number.block
number
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link
docket.entry
number.block
number
24
image.block
image.gateway.link
gateway.image.link
date
11/14/2005
docket.description
NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #
image.gateway.link
1
gateway.image.link
1
)(MULLEN, MARTHA) (ENTERED: 11/14/2005)
from tail

      

Input number 25 (second from the bottom shown above) says:

25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link

      

The problem is that if you look at the XML file itself, you will see that there is also an element with the tag "gateway.image.link", which immediately follows "image.gateway.link" with the text and content of the tail, but by for some reason the iter () function doesn't pick it up. The weird thing is that most of the other descriptions in the docket also have elements with the "image.gateway.link" tag, immediately followed by the element with the "gateway.image.link" tag, as you can see from post number 24 (and the rest of them) , and the iter () function recognizes those, but not this one. Here's the torn XML from the Google Drive doc, I pasted the link to the above:

<?xml version="1.0" encoding="UTF-8" ?><n-extract-response>
<docket.entries.block><label>Entry #:</label><label>Date:</label><label>Description:</label><docket.entry><number.block><number>28</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A1-280450912204" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>ORDER GRANTING DEFENDANTS&apos; MOTION TO DISMISS AND DENYING PLAINTIFF&apos;S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>27</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A2-2704501909813" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>26</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A4-2604501672579" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>03/31/2007</date><docket.description>MEMORANDUM ORDER GRANTING DEFENDANTS&apos; MOTION<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">21</image.gateway.link><gateway.image.link ID="B3-21-0450561212" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">21</gateway.image.link> TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS&apos; DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)</docket.description></docket.entry><docket.entry><number.block><number>25</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A6-2504501577842" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/15/2005</date><docket.description>RESPONSE TO DEFENDANTS&apos; NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B5-1-04511581037" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link> EXHIBIT 1 - NOTICE OF APPEAL)(WISE, RICHARD) (ENTERED: 11/15/2005)</docket.description></docket.entry><docket.entry><number.block><number>24</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A8-2404501579104" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/14/2005</date><docket.description>NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B7-1-04511577643" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link>)(MULLEN, MARTHA) (ENTERED: 11/14/2005)</docket.description></docket.entry></docket.entries.block>
</n-extract-response>

      

When I run my Python script on that particular excerpt exactly like pasted above, it gets the missing element. But when I run the script on the whole XML file, it is not as shown above. Obviously the excerpt is missing a lot of elements above and below, but I don't see how this would affect the iter () function since I didn't break the "docket.entry" element / subelements and that the for loop in my code has to go through every time ( I think).

The problem is not limited to record number 25 - there are several other extracted dict descriptions here and there where the subelement is missing, but I can't distinguish any pattern - I can't even tell the difference between record number 25 and record number 24, which is causing the problem. Can anyone please help?

+3


source to share


3 answers


You are trying to handle the child elements in the start event, but the way iterparse

works, it doesn't guarantee that they have already been read.

There is a note in the documentation about this:

Note:

iterparse () only makes sure it saw the ">" character of the start tag when it fires the "start" event, so the attributes are defined, but the content of the text and tail attributes is undefined at this clause. The same goes for child elements; they may or may not be present.

If you want a completely filled element, search for "end" instead.



If you want to handle child elements, you need to do so at end-event, otherwise there is no guarantee of what content on that element will be available.

The reason why you get any content at all is described here :

Note:

The tree constructor and event generator are not necessarily synchronized; the latter usually lags slightly behind. This means that when you receive a "start" event for an element, the builder may already be filling that element with content. You cannot rely on this, though - the "start" event can only be used to validate attributes, not element content. See this post for more information .

+1


source


getchildren is deprecated since 2.7: use list (elem) or iteration.



0


source


Perhaps you can choose to parse the xml file according to its logical order so that you can control each element precisely. For example.

import xml.etree.ElementTree as ET

tree = ET.parse(r'<xml file name>')
root = tree.getroot()
docket_entries = root.findall('.//docket.entry')
for entry in docket_entries:
    number = entry.find('.//number')
    print number.text
    description = entry.find('docket.description')
    print description.text
    for child in description.getchildren():
        print child

      

-1


source







All Articles