Xml.etree.ElementTree iterparse () still using a lot of memory?

I experimented with iterparse to reduce the memory footprint of my scripts that require processing large XML documents. Here's an example. I wrote this simple script to read a TMX file and split it into one or more output files so that it does not exceed a user-specified size. Despite using iterparse, when I split an 886MB file into 100MB of files, the script escapes with all available memory (shredding to traversal when using 6.5 of my 8MB).

Am I doing something wrong? Why is memory usage going so high?

#! /usr/bin/python
# -*- coding: utf-8 -*-
import argparse
import codecs
from xml.etree.ElementTree import iterparse, tostring
from sys import getsizeof

def startNewOutfile(infile, i, root, header):
    out = open(infile.replace('tmx', str(i) + '.tmx'), 'w')
    print >>out, '<?xml version="1.0" encoding="UTF-8"?>'
    print >>out, '<!DOCTYPE tmx SYSTEM "tmx14.dtd">'
    print >>out, roottxt
    print >>out, headertxt
    print >>out, '<body>'
    return out

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-m', '--maxsize', dest='maxsize', required=True, type=float, help='max size (in MB) of output files')
    parser.add_argument(dest='infile', help='.tmx file to be split')
    args = parser.parse_args()

    maxsize = args.maxsize * 1024 * 1024

    nodes = iter(iterparse(args.infile, events=['start','end']))

    _, root = next(nodes)
    _, header = next(nodes)

    roottxt = tostring(root).strip()
    headertxt = tostring(header).strip()

    i = 1
    curr_size = getsizeof(roottxt) + getsizeof(headertxt)
    out = startNewOutfile(args.infile, i, roottxt, headertxt)

    for event, node in nodes:
        if event =='end' and node.tag == 'tu':
            nodetxt = tostring(node, encoding='utf-8').strip()
            curr_size += getsizeof(nodetxt)
            print >>out, nodetxt
        if curr_size > maxsize:
            curr_size = getsizeof(roottxt) + getsizeof(headertxt)
            print >>out, '</body>'
            print >>out, '</tmx>'
            out.close()
            i += 1
            out = startNewOutfile(args.infile, i, roottxt, headertxt)
        root.clear()

    print >>out, '</body>'
    print >>out, '</tmx>'
    out.close()

      

+3


source to share


1 answer


Found the answer in a related question: Why is elementtree.ElementTree.iterparse using so much memory?

Each iteration of the for loop requires not only root.clear (), but node.clear (). Since we handle both start and end events, we must be careful not to remove tu nodes too early:



for e, node in nodes:
    if e == 'end' and node.tag == 'tu':
        nodetxt = tostring(node, encoding='utf-8').strip()
        curr_size += getsizeof(nodetxt)
        print >>out, nodetxt
        node.clear()
    if curr_size > maxsize:
        curr_size = getsizeof(roottxt) + getsizeof(headertxt)
        print >>out, '</body>'
        print >>out, '</tmx>'
        out.close()
        i += 1
        out = startNewOutfile(args.infile, i, roottxt, headertxt)
    root.clear()

      

+4


source







All Articles