Why does this regex work?
OK. I fully understand why this regex works. The text I'm working on is this:
<html>
<body>
hello
<img src="withalt" alt="hi"/>asdf
<img src="noalt" />fdsa<a href="asdf">asdf</a>
<img src="withalt2" alt="blah" />
</body>
</html>
Using the following regex (tested in php, but I assume this is true for all perl regexes) it will return all img tags that do not contain an alt tag:
/<img(?:(?!alt=).)*?>/
Returns:
<img src="noalt" />
So, based on that, I would think that simply deleting without a backreference would return the same:
/<img(?!alt=).*?>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />
<img src="withalt2" alt="blah" />
As you can see, it just returns all the image tags. Then, to make things even more confusing, deleting? (just a wildcard as far as I know) after the * reverts to the final>
/<img(?!alt=).*>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />fdsa<a href="asdf">asdf</a>
<img src="withalt2" alt="blah" />
So, does anyone want to let me know, or at least point me in the right direction, what's going on here?
source to share
/<img(?:(?!alt=).)*?>/
This regex applies a negative appearance to every character that it matches after img
. So, as soon as he finds it alt=
, he stops. Thus, it will match a tag img
that has no attribute alt
.
/<img(?!alt=).*?>/
This regex just applies negative forward prediction after img
. This way, it will match everything before the first >
for the entire tag img
that is not followed alt=
, regardless of whether it appears alt=
anywhere further down the line. It will be reviewed in.*?
/<img(?!alt=).*>/
This is the same as the previous one, but it matches every last >
one as it uses greedy matching
. But I don't know why you got this result. You must get every last one >
for </html>
.
Now, forget everything that happened there and head in the direction HTML Parser
for parsing HTML
. They are specially designed for this task. So, don't bother using regex because you cannot parse all HTML types through regex.
source to share