How can PHP be cleaned up with an infinite scrolling webpage?

I would like to know how it is possible to clean up in a loop (page 1 page 2etc ....) a webpage with infinite loops (like imgur) ...?

I tried the code below, but it only returns the first page. How can I launch the next page due to infinite scroll pattern?

<?php
    $mr = $maxredirect === null ? 10 : intval($maxredirect);
    if (ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off')) {
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $mr > 0);
        curl_setopt($ch, CURLOPT_MAXREDIRS, $mr);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    } else {
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

        if ($mr > 0) {
            $original_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
            $newurl = $original_url;
            $rch = curl_copy_handle($ch);

            curl_setopt($rch, CURLOPT_HEADER, true);
            curl_setopt($rch, CURLOPT_NOBODY, true);
            curl_setopt($rch, CURLOPT_FORBID_REUSE, false);
            do {
                curl_setopt($rch, CURLOPT_URL, $newurl);
                $header = curl_exec($rch);
                if (curl_errno($rch)) {
                    $code = 0;
                } else {
                    $code = curl_getinfo($rch, CURLINFO_HTTP_CODE);
                    if ($code == 301 || $code == 302) {
                        preg_match('/Location:(.*?)\n/', $header, $matches);
                        $newurl = trim(array_pop($matches));

                        // if no scheme is present then the new url is a
                        // relative path and thus needs some extra care
                        if(!preg_match("/^https?:/i", $newurl)){
                            $newurl = $original_url . $newurl;
                        }
                    } else {
                        $code = 0;
                    }
                }
            } while ($code && --$mr);
            curl_close($rch);
            if (!$mr) {
                if ($maxredirect === null)
                    trigger_error('Too many redirects.', E_USER_WARNING);
                else
                    $maxredirect = 0;
                return false;
            }
            curl_setopt($ch, CURLOPT_URL, $newurl);
        }
    }
    return curl_exec($ch);
}

$ch = curl_init('http://www.imgur.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec_follow($ch);
curl_close($ch);

echo $data;
?>

      

+3


source to share


2 answers


cURL

works by getting the source code of the web page. Your code will only collect HTML from the original web page. In the case of imgur, it will include ~ 40 images as well as the rest of the page layout.

This original source code doesn't change when scrolling down. However, the HTML is inside your browser. This is done using AJAX. The page you are looking at for requests from the second page.

If you are using FireBug (for FireFox) or the Google Chrome Page inspector, you can track these requests by going to the Network or Network tab (respectively). When you scroll down the page, the page will make ~ 45 more requests or so (mostly for images). You will also see that it is requesting this page:



http://imgur.com/gallery/hot/viral/day/page/0?scrolled&set=1

The JavaScript on the imgur home page appends this HTML to the bottom of the home page. You probably want to request this page (or API like the other Chris ) if you want to get a list of images. You can play with the numbers at the end of the url to get more images.

+2


source


Page scraper is rarely the best approach for such reasons. Imgur offers an API that does the tasks I assume you are trying without using hacky paper clips.

If you are married with the idea of ​​scraping, you will need to do some research. Instead of clearing just the main page, you need to pay attention to the API used in the AJAX request, you can call it directly and continue clearing subsequent data pages. The specifics of this approach are beyond the scope of this answer, especially given that there is an established API.



Related reading

0


source







All Articles