.NET regexes in an endless loop

Question

.NET regexes in an endless loop

I am using .NET Regular Expressions to strip HTML code.

Using something like:

<title>(?<Title>[\w\W]+?)</title>[\w\W]+?<div class="article">(?<Text>[\w\W]+?)</div>

It works 99% of the time, but sometimes when analyzing ...

Regex.IsMatch(HTML, Pattern)

The parser just blocks and it will continue on that line of code for a few minutes or indefinitely.

What's happening?

+1

c # regex vb.net visual-studio

InfoStatus 27 nov. '08 at 14:56

source to share

3 answers

With some effort you can get the regex to work in html - have you looked at the HTML flexibility package , however ? This makes it easier to work with html as DOM with support for xpath queries, etc. (Ie "// div [@ class =" article '] ").

+3

Marc gravell 27 nov. '08 at 15:08

source to share

You are asking your regex to do a lot. After each character, it has to look ahead to see if the next bit of text can be matched against the next part of the pattern.

Regex is a pattern matching tool. While you can use it for simple parsing, you are better off using a specific parser (like the HTML Flexibility Package as mentioned in my brand).

+1

David Kemp 27 nov. '08 at 15:10

source to share

Jan Goyvaerts · Accepted Answer · 2008-11-27T17:52:00+0000

Your regex will work great when your HTML string actually contains HTML that matches the pattern. But when your HTML doesn't match the template eg. if the last tag is missing, your regex will display what I call " catastrophic backtracking ". Click that link and scroll down to the Quick Match Full HTML File section. He accurately describes your problem. [\ W \ W] +? it's a tricky way to say. +? with RegexOptions.SingleLine.

.NET regexes in an endless loop

More articles: