PHP Scraper with curl - How can I debug

I just found out what to break and cUrl a few hours ago and I've been playing around with that ever since. However, we now have something strange in front of us. Here below code works fine with some sites and not others (of course I changed url and xpath ...). Note that I have no error checking if curl_exec was executed correctly. So the problem has to come from what happened after. Some of my questions are:

  • How to check if a new DOMDocument was created correctly: if (??)
  • How can I check if a new DOMDocument is correctly populated using html?
  • ... if a new DOMXPath object was created?

I hope I get it. Thanks in advance for your answers. Greetings. Mark

My php:

$target_url = "";
$userAgent = 'Googlebot/2.1 (';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);

if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);

// parse the html into a DOMDocument
$dom = new DOMDocument();

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('somepath');

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    echo "<br />Link: $url";




source to share

2 answers

The problem has been resolved. The error came from firebug giving the wrong path. Many thanks to MrCode for his support ...



Use try / catch to check if the document object has been created, then check the return value of loadHTML () to determine if HTML has been loaded into the document. You can also use try / catch on XPath object.

    $dom = new DOMDocument();

    $loaded = $dom->loadHTML($html);

        // loaded OK
        // could not load HTML
catch(Exception $e)
    // document could not be created, see $e->getMessage()




All Articles