Parsing a very large HTML file with Python (ElementTree?)

I asked to use BeautifulSoup to parse a very large (270MB) HTML file and get a memory error and pointed to ElementTree as the solution.

I tried to use their event driven parsing described here . Testing it with a file with less settings worked fine:

>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
    print("%5s, %4s, %s" % (event, element.tag, element.text))

      

Successfully prints items. However, using the same code with 'messages.htm' instead of 'settings.htm' to see if it works even before starting the actual coding process, this is the result:

Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    for event, element in ET.iterparse(source, events=("start", "end")):
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6

      

I'm wondering if this is because ET is better suited for parsing XML documents? If so, and there is no workaround, I'll go back to the first one. Any suggestions on how to parse this file as well as how to debug along the way would be greatly appreciated!

+2


source to share


2 answers


A good solution for parsing HTML or XML is lxml

and xpath

.

Use xpath:



from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
    print tr.xpath('./td/text()')

      

0


source


Html is not perfect XML. This is why in some cases you use HTMLParser instead of ElementTree to parse the html file.



Best regards, Emmanuel

0


source







All Articles