Why is elementtree.ElementTree.iterparse using so much memory?

I am using elementtree.ElementTree.iterparse to parse a large (371MB) xml file.

My code is basically like this:

outf = open('out.txt', 'w') 
context = iterparse('copyright.xml')
context = iter(context)
dummy, root = context.next()

for event, elem in context:
    if elem.tag == 'foo':
        author = elem.text

    elif elem.tag == 'bar':
        if elem.text is not None and 'bat' in elem.text.lower():
            outf.write(elem.text + '\n')
    elem.clear()   #line A
    root.clear()   #line B

      

My question is twofold:

First - Do I need A and B (see comments on the code snippet)? I was told that root.clear () cleans up unneeded children so it doesn’t gobble up memory, but here are my observations: using B, not A, is the same as using either in terms of memory consumption (built with the task manager). Using only A seems to be the same as using both.

Second, why is it still consuming so much memory? As the program runs, it uses about 100MB of RAM at the end.

I'm guessing it has something to do with outf, but why? Isn't that just writing to disk? And if it stores this data until it is closed, how can I avoid it?

Additional info: I am using Python 2.7.3 on Windows.

+1


source to share


2 answers


(The code posted, indented by the second line, shouldn't run.) Http://bugs.python.org/issue14762 had a similar problem and the answer is that you have to clean up every item (line A). Without seeing what outf is (or the code that created it), it's hard to answer the second question. If it were a StringIO object, the answer would be obvious. You can take a look at the tutorial linked in the second tracker issue post:



http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

+2


source


Use xml.etree.cElementTree.iterparse()

[in Python 2.x] instead.



Life is too short to debug other people's mistakes.

0


source







All Articles