Parsing HTML using pugixml or an actual HTML parser

I am interested in using pugixml to parse HTML documents, but there are optional closing tags in HTML . Here's an example:<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

Pugixml stops reading HTML as soon as it encounters a tag that is not closed, but in HTML missing an end tag does not necessarily mean there is a start tag mismatch.

Simple pugixml parsing HTML documentation parsing fails because the meta tag is the second line of the HTML document: http://pugixml.googlecode.com/svn/tags/latest/docs/quickstart.html

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<title>pugixml 1.0</title>
<link rel="stylesheet" href="pugixml.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.75.2">
<link rel="home" href="quickstart.html" title="pugixml 1.0">
</head>
<!--- etc... -->

      

Many HTML documents in the wild will fail if I try to parse them with pugixml. Is there a way to avoid this? If there is no way to "fix", then is there another HTML parsing tool that is about as fast as pugixml?

Update

It would be great if the HTML parser also supports XPATH.

+3


source to share


3 answers


I ended up taking pugixml, turning it into an HTML parser, and I created a github project for it: https://github.com/rofldev/pugihtml



It doesn't fully comply with the HTML specs yet, but it does a decent enough job of parsing HTML that I can use. I am working to get it to conform to the HTML spec.

+3


source


One way to solve this problem is to do some preprocessing that converts HTML to XHTML, then it will be "officially" considered XML and used with XML tools. If you'd like to go this route, see this question: How to convert HTML to XHTML?



+1


source


(adsbygoogle = window.adsbygoogle || []). push ({google_ad_client: "ca-pub-3830370040081933", enable_page_level_ads: true});
0


source







All Articles