Python html parsing not done in document using javascript

Question

Python html parsing not done in document using javascript

I am trying to use Python to parse HTML (although strictly speaking the server claims to be xhtml) and every handler I tried (ElementTree, minidom and lxml) fails. When I go to see where the problem is, inside the script tag:

<script type="text/javascript">
... // some javascript code
    if (condition1 && condition2) { // croaks on this line

I see what the problem is, the ampersand must be specified. The problem is that it is inside a javascript script tag, so it cannot be quoted because it will break the code.

What's going on here? How can inline javascript break my parsing and what can I do with it?

Update: For every request, here's the code used with lxml.

>>> from lxml import etree
>>> tree=etree.parse("http://192.168.1.185/site.html")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
  File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95050)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 77, column 22

The lxml manual starts Chapter 9 by stating that "lxml provides a very simple and powerful API for parsing XML and HTML," so I would expect to see no exception.

+3

python html

Michael 08 oct. 14 at 22:18

source to share

1 answer

Jonathan Eunice · Accepted Answer · 2014-10-08T22:30:11+0000

There are a lot of really crappy ways to parse HTML. Bad HTML is ubiquitous, and both sections script

and various templating languages are throwing at monkey clues.

But you also seem to be using XML oriented parsers for the job, which are stricter and therefore much, much more likely to break if absolutely correct, fully valid data were not provided. Which of most HTML, including most XHTML, is clearly not.

So, use a parser designed to view some HTML files:

import lxml.html 
d = lxml.html.parse(URL)

This should take you to the race.

Python html parsing not done in document using javascript

More articles: