Why does this regex work?

OK. I fully understand why this regex works. The text I'm working on is this:

<html>
  <body>
    hello
    <img src="withalt" alt="hi"/>asdf
    <img src="noalt" />fdsa<a href="asdf">asdf</a>
    <img src="withalt2" alt="blah" />
  </body>
</html>

      

Using the following regex (tested in php, but I assume this is true for all perl regexes) it will return all img tags that do not contain an alt tag:

/<img(?:(?!alt=).)*?>/
Returns:
<img src="noalt" />

      

So, based on that, I would think that simply deleting without a backreference would return the same:

/<img(?!alt=).*?>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />
<img src="withalt2" alt="blah" />

      

As you can see, it just returns all the image tags. Then, to make things even more confusing, deleting? (just a wildcard as far as I know) after the * reverts to the final>

/<img(?!alt=).*>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />fdsa<a href="asdf">asdf</a>
<img src="withalt2" alt="blah" />

      

So, does anyone want to let me know, or at least point me in the right direction, what's going on here?

+3


source to share


1 answer


/<img(?:(?!alt=).)*?>/

      

This regex applies a negative appearance to every character that it matches after img

. So, as soon as he finds it alt=

, he stops. Thus, it will match a tag img

that has no attribute alt

.

/<img(?!alt=).*?>/

      

This regex just applies negative forward prediction after img

. This way, it will match everything before the first >

for the entire tag img

that is not followed alt=

, regardless of whether it appears alt=

anywhere further down the line. It will be reviewed in.*?



/<img(?!alt=).*>/

      

This is the same as the previous one, but it matches every last >

one as it uses greedy matching

. But I don't know why you got this result. You must get every last one >

for </html>

.


Now, forget everything that happened there and head in the direction HTML Parser

for parsing HTML

. They are specially designed for this task. So, don't bother using regex because you cannot parse all HTML types through regex.

+2


source







All Articles