Parsing a very large HTML file with Python (ElementTree?)
I tried to use their event driven parsing described here . Testing it with a file with less settings worked fine:
>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
print("%5s, %4s, %s" % (event, element.tag, element.text))
Successfully prints items. However, using the same code with 'messages.htm' instead of 'settings.htm' to see if it works even before starting the actual coding process, this is the result:
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
for event, element in ET.iterparse(source, events=("start", "end")):
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6
I'm wondering if this is because ET is better suited for parsing XML documents? If so, and there is no workaround, I'll go back to the first one. Any suggestions on how to parse this file as well as how to debug along the way would be greatly appreciated!
+2
source to share
2 answers