Is it possible for scrapy to use Phantomjs directly to load the source of the page for rendering?

In my CustomDownloaderMiddleware:

    def process_request(self, request, spider):
        if spider.name == 'UrlSpider':
            res = requests.get(request.url)
            return HtmlResponse(request.url, body=res.content, encoding='utf-8', request=request)

      

I want to do reponse.body in def process_response, what should I do?

+3


source to share


1 answer


There is scrapy middleware that will do exactly that: it will run your requests through PhantomJS and your responses will contain the rendered html.

You will find it here and it works well for me (although not well tested according to its author): https://github.com/brandicted/scrapy-webdriver

If you are not tied to PhantomJS you can also take a look at https://github.com/scrapy-plugins/scrapy-splash as this is much better supported (by the same people who develop the cure).

Update

If you only want to clean up some pages through PhantomJS, I see two possible ways to do it:

  • Most likely it is possible to do some Javascript magic response.body

    to inject html from yours into PhantomJS and render it on this page.


This will be exactly what you want, but it can be a little tricky to get it right. (did some Javascript magic with PhantomJS, and it's often not as easy as I hoped).

  1. You can register a PhantomJS loader alongside the standard middleware and load the pages you want to render a second time, but this time through the PhantomJS loader.

To do this, activate the PhantomJS loader as follows in settings.py

:

# note the 'js-' in front of http
DOWNLOAD_HANDLERS = {
    'js-http': 'scrapy_webdriver.download.WebdriverDownloadHandler',
    'js-https': 'scrapy_webdriver.download.WebdriverDownloadHandler',
}

      

And then in your parse method:

def parse(self, response):
    if should_be_rendered(response):
        phantom_url = response.url.replace("http", "js-http")
        # do the same request again but this time through the WebdriverDownloadHandler
        yield Request(phantom_url, ...)

      

+2


source







All Articles