Why does this regex return errors when I use it to find img src from HTML?

I am writing a function that outputs the src from the first image tag it finds in the html file. Following the instructions in this thread , I got what seemed to work:

preg_match_all('#<img[^>]*>#i', $content, $match); 

foreach ($match as $value) {
    $img = $value[0];
                           } 

$stuff = simplexml_load_string($img);
$stuff = $stuff[src];
return $stuff;

      

But a few minutes after using this function, it started returning errors like this:

warning: simplexml_load_string () [0function.simplexml-load-string0]: Entity: line 1: parser error: premature end of data in img tag line 1 in path / to / script on line 42.

and

warning: simplexml_load_string () [0function.simplexml-load-string0]: tp: //feeds.feedburner.com/~f/ChicagobusinesscomBreakingNews? i = KiStN "border =" 0 "> in path / in / script on line 42.

I'm kind of new to PHP, but it looks like my regex is fixing HTML incorrectly. How can I make it more "airtight"?

+1


source to share


4 answers


These two lines of PHP code should give you a list of all the src attribute values ​​in all img tags in the HTML file:

preg_match_all('/<img\s+[^<>]*src=["\']?([^"\'<>\s]+)["\']?/i', $content, $result, PREG_PATTERN_ORDER);
$result = $result[1];

      



To keep the regex simpler, I don't allow filenames in them. If you want to allow this, you need to use separate alternatives for quoted attribute values ​​(which can have spaces) and unquoted values ​​(which cannot have spaces).

+2


source


Most likely because the "XML" received by the regex is not valid XML for whatever reason. I would probably go for a more complex regex that pulls out the src attribute instead of using SimpleXML to get the src. This REGEX might be close to what you need.

<img[^>]*src\s*=\s*['|"]?([^>]*?)['|"]?[^>]*>

      



You can also use the real Parsing HTML library, but I'm not sure what options exist in PHP.

0


source


The ampersand itself in the attribute is invalid XML (it must be encoded as "& amp;"), but some people still put it that way from URLs in HTML pages (and all browsers support it). Maybe there is your problem.

If so, you can sanitize your string before parsing it by replacing " &(?!amp;)

" with " &amp;

".

0


source


On another question:

foreach ($match as $value) {
    $img = $value[0];
                           } 

      

can be replaced with

$img = $match[count($match) - 1][0];

      

Something like that:

if (preg_match('#<img\s[^>]*>#i', $content, $match)) {
    $img = $match[0]; //first image in file only
    $stuff = simplexml_load_string($img);
    $stuff = $stuff[src];
    return $stuff;
} else {
    return null; //no match found
}

      

0


source







All Articles