Why does this regex work?

Question

Why does this regex work?

OK. I fully understand why this regex works. The text I'm working on is this:

<html>
  <body>
    hello
    <img src="withalt" alt="hi"/>asdf
    <img src="noalt" />fdsa<a href="asdf">asdf</a>
    <img src="withalt2" alt="blah" />
  </body>
</html>

Using the following regex (tested in php, but I assume this is true for all perl regexes) it will return all img tags that do not contain an alt tag:

/<img(?:(?!alt=).)*?>/
Returns:
<img src="noalt" />

So, based on that, I would think that simply deleting without a backreference would return the same:

/<img(?!alt=).*?>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />
<img src="withalt2" alt="blah" />

As you can see, it just returns all the image tags. Then, to make things even more confusing, deleting? (just a wildcard as far as I know) after the * reverts to the final>

/<img(?!alt=).*>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />fdsa<a href="asdf">asdf</a>
<img src="withalt2" alt="blah" />

So, does anyone want to let me know, or at least point me in the right direction, what's going on here?

+3

regex

Eric Feb 14 13 at 21:42

source to share

1 answer

Rohit jain · Accepted Answer · 2013-02-14T21:56:09+0000

/<img(?:(?!alt=).)*?>/

This regex applies a negative appearance to every character that it matches after img

. So, as soon as he finds it alt=

, he stops. Thus, it will match a tag img

that has no attribute alt

.

/<img(?!alt=).*?>/

This regex just applies negative forward prediction after img

. This way, it will match everything before the first >

for the entire tag img

that is not followed alt=

, regardless of whether it appears alt=

anywhere further down the line. It will be reviewed in.*?

/<img(?!alt=).*>/

This is the same as the previous one, but it matches every last >

one as it uses greedy matching

. But I don't know why you got this result. You must get every last one >

for </html>

.

Now, forget everything that happened there and head in the direction HTML Parser

for parsing HTML

. They are specially designed for this task. So, don't bother using regex because you cannot parse all HTML types through regex.

Why does this regex work?

More articles: