.NET regexes in an endless loop

I am using .NET Regular Expressions to strip HTML code.

Using something like:

<title>(?<Title>[\w\W]+?)</title>[\w\W]+?<div class="article">(?<Text>[\w\W]+?)</div>

      

It works 99% of the time, but sometimes when analyzing ...

Regex.IsMatch(HTML, Pattern)

      

The parser just blocks and it will continue on that line of code for a few minutes or indefinitely.

What's happening?

+1


source to share


3 answers


Your regex will work great when your HTML string actually contains HTML that matches the pattern. But when your HTML doesn't match the template eg. if the last tag is missing, your regex will display what I call " catastrophic backtracking ". Click that link and scroll down to the Quick Match Full HTML File section. He accurately describes your problem. [\ W \ W] +? it's a tricky way to say. +? with RegexOptions.SingleLine.



+6


source


With some effort you can get the regex to work in html - have you looked at the HTML flexibility package , however ? This makes it easier to work with html as DOM with support for xpath queries, etc. (Ie "// div [@ class =" article '] ").



+3


source


You are asking your regex to do a lot. After each character, it has to look ahead to see if the next bit of text can be matched against the next part of the pattern.

Regex is a pattern matching tool. While you can use it for simple parsing, you are better off using a specific parser (like the HTML Flexibility Package as mentioned in my brand).

+1


source







All Articles