Using BeautifulSoup on very large HTML file - memory error?

I am learning Python while working on a Facebook message parser project. I have uploaded my data including the messages.htm file of all my messages. I am trying to write a program to parse this file and output the data (number of messages, most common words, etc.).

However, my messages.htm file is 270MB. When creating a BeautifulSoup object in a test wrapper, any other file (all <1MB) works just fine. But I cannot create bs messages.htm object. Here's the error:

>>> mf = open('messages.htm', encoding="utf8")
>>> ms = bs4.BeautifulSoup(mf)
Traceback (most recent call last):
  File "<pyshell#73>", line 1, in <module>
    ms = bs4.BeautifulSoup(mf)
  File "C:\Program Files (x86)\Python\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
  File "C:\Program Files (x86)\Python\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

      

So I can't even get started with this file. This is my first time doing something like this and I'm just learning Python so any suggestions would be much appreciated!

0


source to share


1 answer


Since you're using this as a learning exercise, I won't give too much code. You might be better off with ElementTree iterparse so you can process when parsing. As far as I know, BeautifulSoup doesn't have this feature.

To get started:



import xml.etree.cElementTree as ET

with open('messages.htm') as source:

    # get an iterable
    context = ET.iterparse(source, events=("start", "end"))

    # turn it into an iterator
    context = iter(context)

    # get the root element
    event, root = context.next()

    for event, elem in context:
        # do something with elem

        # get rid of the elements after processing
        root.clear()

      

If you are in the mood to use BeautifulSoup, you can look into breaking the original HTML into manageable chunks, but you need to be careful to preserve the structure of the message flow and ensure that you preserve valid HTML.

+1


source







All Articles