Iterative parsing of a large XML file without using the DOM approach

Question

Iterative parsing of a large XML file without using the DOM approach

I have xml file

<temp>
  <email id="1" Body="abc"/>
  <email id="2" Body="fre"/>
  .
  .
  <email id="998349883487454359203" Body="hi"/>
</temp>

I want to read an XML file for each email tag. That is, at the time I want to read the email id = 1..extract body from it, read the id id = 2 ... and extract the body from it ... etc.

I tried to do it using DOM to parse XML since my file size is 100GB. The approach doesn't work. Then I tried using:

  from xml.etree import ElementTree as ET
  tree=ET.parse('myfile.xml')
  root=ET.parse('myfile.xml').getroot()
  for i in root.findall('email/'):
              print i.get('Body')

Now when I get root. I don't understand why my code was unable to parse.

The code throws the following error when using iterparse:

 "UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 437: ordinal not in range(128)"

Can anyone help

+3

python xml xml-parsing lxml

Jannat Arora 06 Apr 12 at 7:23

source to share

1 answer

Dikei · Accepted Answer · 2012-04-06T07:39:58+0000

Example for iterparse:

import cStringIO
from xml.etree.ElementTree import iterparse

fakefile = cStringIO.StringIO("""<temp>
  <email id="1" Body="abc"/>
  <email id="2" Body="fre"/>
  <email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
    if elem.tag == 'email':
        print elem.attrib['id'], elem.attrib['Body']
    elem.clear()

Just replace fakefile with your real file. Also read this one for more details.

Iterative parsing of a large XML file without using the DOM approach

More articles: