Why use dom to parse web pages instead of regex?

I have searched for questions about finding content on a page and many of the answers recommend using DOM

when analyzing web pages instead REGEX

. Why is this so? Does it improve processing time or something.

+3


source to share


3 answers


The DOM parser actually parses the page.

The regular expression searches for text without understanding the semantic meaning of HTML.

HTML is proven not to be a regular language; therefore, it is not possible to create a regular expression that will parse all instances of an arbitrary template element from an HTML document without matching some text that is not an instance of that template element. p>



You might be able to create a regex that works for your specific use case, but anticipating exactly the HTML you will be provided with (and thus how it will break your limited use regex) is extremely difficult.

Also, regex is harder to adapt to page content changes than XPath expression, and XPath is (in my opinion) easier to read as it doesn't have to deal with syntactic coefficients and ends like tag openings and closings.

So, instead of using the wrong tool for the assignment (text parser tool for a structured document), use the right tool for the assignment (HTML parser for parsing HTML).

+6


source


I can't hear that "HTML is not an ordinary language ...". Regular expressions (used in modern languages) are also not regular.

Simple answer:

The regex is not a parser , it describes a pattern and matches that pattern, but has no idea about the structure of the document. You cannot parse anything with a single regex. Of course regex can be part of the parser, I don't know, but my guess is that almost every parser will use regex internally to find certain sub-patterns.



If you can create this template for the stuff you want to find inside HTML, well, use it. But very often you will not be able to create this pattern because it is almost impossible to cover all corner cases or dependencies such as finding all links, but only if they are green and not pink.

In most cases it is much easier to use Parser, which understands the structure of your document, which also accepts a lot of "broken" HTML. This makes it easier for you to access all links or all elements of a table of a particular table, or ...

+1


source


In my opinion, it is safer to use REGEXP on pages where you have no control over the content: the HTML may be malformed and the DOM parser may fail.

Edit:
Well, considering what I just read, you should probably only use regexp if you need very small things like getting all document references, e tc.

-1


source







All Articles