Using goutte to read from file / line

I am using Goutte to create a webscraper.

For development, I saved a .html document that I would like to go through (so I don't make requests to the website all the time). Here's what I have so far:

use Goutte\Client;

$client = new Client();
$html=file_get_contents('test.html');
$crawler = $client->request(null,null,[],[],[],$html);

      

Which one I know should make a request in Symfony \ Component \ BrowserKit and pass the original body data. Here is the error message I receive:

PHP Fatal error:  Uncaught exception 'GuzzleHttp\Exception\ConnectException' with message 'cURL error 7: Failed to connect to localhost port 80: Connection refused (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in C:\Users\Ally\Sites\scrape\vendor\guzzlehttp\guzzle\src\Handler\CurlFactory.

      

If I were just using DomCrawler, then creating a crawler using a string would be non-trivial. (see http://symfony.com/doc/current/components/dom_crawler.html ). I just don't know how to do the equivalent with Goutte.

Thanks in advance.

+3


source to share


1 answer


The tools you choose to use make real http connections and are not good for what you want to do. At least out of the box.

Option 1: Implement your own BrowserKit client

In any case, goutte extends BrowserKit Client . It implements HTTP requests using Guzzle.

All you need to do to implement your own client is to extend Symfony\Component\BrowserKit\Client

and provide a method doRequest()

:

use Symfony\Component\BrowserKit\Client;
use Symfony\Component\BrowserKit\Request;
use Symfony\Component\BrowserKit\Response;

class FilesystemClient extends Client
{
    /**
     * @param object $request An origin request instance
     *
     * @return object An origin response instance
     */
    protected function doRequest($request)
    {
        $file = $this->getFilePath($request->getUri());

        if (!file_exists($file)) {
            return new Response('Page not found', 404, []);
        }

        $content = file_get_contents($file);

        return new Response($content, 200, []);
    }

    private function getFilePath($uri)
    {
        // convert an uri to a file path to your saved response
        // could be something like this:
        return preg_replace('#[^a-zA-Z_\-\.]#', '_', $uri).'.html';
    }
}

      

 $client = new FilesystemClient();
 $client->request('GET', '/test');

      

The client request()

must accept real URIs, so you need to implement your own logic to convert it to a file system location.



Have a look at Goutte Client for insipration.

Option 2. Implementing a custom Gzzle handler

Since Goutte uses Guzzle, you can provide your own Guzzle handler that will load responses from files instead of making actual HTTP requests. Look at the handlers and middleware document .

If you're only after caching responses, so you're making fewer HTTP requests, Guzzle already has this support.

Option 3: use DomCrawler directly

new Crawler(file_get_contents('test.html'))

      

The only drawback is that you will lose some of the convenient BrowserKit client methods, like click()

or selectLink()

.

+3


source







All Articles