Why does this regex return errors when I use it to find img src from HTML?
I am writing a function that outputs the src from the first image tag it finds in the html file. Following the instructions in this thread , I got what seemed to work:
preg_match_all('#<img[^>]*>#i', $content, $match);
foreach ($match as $value) {
$img = $value[0];
}
$stuff = simplexml_load_string($img);
$stuff = $stuff[src];
return $stuff;
But a few minutes after using this function, it started returning errors like this:
warning: simplexml_load_string () [0function.simplexml-load-string0]: Entity: line 1: parser error: premature end of data in img tag line 1 in path / to / script on line 42.
and
warning: simplexml_load_string () [0function.simplexml-load-string0]: tp: //feeds.feedburner.com/~f/ChicagobusinesscomBreakingNews? i = KiStN "border =" 0 "> in path / in / script on line 42.
I'm kind of new to PHP, but it looks like my regex is fixing HTML incorrectly. How can I make it more "airtight"?
These two lines of PHP code should give you a list of all the src attribute values ββin all img tags in the HTML file:
preg_match_all('/<img\s+[^<>]*src=["\']?([^"\'<>\s]+)["\']?/i', $content, $result, PREG_PATTERN_ORDER);
$result = $result[1];
To keep the regex simpler, I don't allow filenames in them. If you want to allow this, you need to use separate alternatives for quoted attribute values ββ(which can have spaces) and unquoted values ββ(which cannot have spaces).
source to share
Most likely because the "XML" received by the regex is not valid XML for whatever reason. I would probably go for a more complex regex that pulls out the src attribute instead of using SimpleXML to get the src. This REGEX might be close to what you need.
<img[^>]*src\s*=\s*['|"]?([^>]*?)['|"]?[^>]*>
You can also use the real Parsing HTML library, but I'm not sure what options exist in PHP.
source to share
The ampersand itself in the attribute is invalid XML (it must be encoded as "& amp;"), but some people still put it that way from URLs in HTML pages (and all browsers support it). Maybe there is your problem.
If so, you can sanitize your string before parsing it by replacing " &(?!amp;)
" with " &
".
source to share
On another question:
foreach ($match as $value) {
$img = $value[0];
}
can be replaced with
$img = $match[count($match) - 1][0];
Something like that:
if (preg_match('#<img\s[^>]*>#i', $content, $match)) {
$img = $match[0]; //first image in file only
$stuff = simplexml_load_string($img);
$stuff = $stuff[src];
return $stuff;
} else {
return null; //no match found
}
source to share