Proxies with Scrapy-Splash

Question

Proxies with Scrapy-Splash

I am trying to get proxies to work on my local splash instance. I've read several docs but haven't found any valid examples. It has been brought to my attention that the reason for this is https://github.com/scrapy-plugins/scrapy-splash/issues/107 . I don't get this trace anymore, but still can't use Splash with a proxy. New error message below. Thanks in advance if anyone can help me solve this problem. None of my requests even made it to Splash.

  def parse_json(self, response):
    json_data = response.body
    load = json.loads(json_data.decode('utf-8'))
    dump = json.dumps(load,sort_keys=True,indent=2)
    LUA_SOURCE = """
    function main(splash)
        local host = "proxy.crawlera.com"
        local port = 8010
        local user = "APIKEY"
        local password = ""
        local session_header = "X-Crawlera-Session"
        local session_id = "create"

        splash:on_request(function (request)
            request:set_header("X-Crawlera-UA", "desktop")
            request:set_header(session_header, session_id)
            request:set_proxy{host, port, username=user, password=password}
        end)

        splash:on_response_headers(function (response)
            if response.headers[session_header] ~= nil then
                session_id = response.headers[session_header]
            end
        end)

        splash:go(splash.args.url)
        return splash:html()
    end
    """
    for link in load['d']['blogtopics']:
        link = link['Uri']
        yield SplashRequest(link, self.parse_blog, endpoint='execute',  args={'wait': 3, 'lua_source': LUA_SOURCE})


2017-03-29 09:26:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET http://community.martindale.com/legal-blogs/Practice_Areas/b/corporate__securities_law/archive/2011/08/11/sec-adopts-new-rules-replacing-credit-ratings-as-a-criterion-for-the-use-of-short-form-shelf-registration.aspx via http://localhost:8050/execute> (referer: None)

+3

python web-scraping scrapy scrapy-splash splash-js-render

eusid 29 Mar 17 at 10:01

source to share

1 answer

eusid · Accepted Answer · 2017-03-30T03:02:07+0000

The problem comes from the Crawlera middleware. There is no processing for SplashRequest. It tries to go through the proxy to the local host.

Proxies with Scrapy-Splash

More articles: