How to "pause" a spider in Scrapy?

Question

How to "pause" a spider in Scrapy?

I am using Tor (via Privoxy) for a scrambling project and would like to write a Scrapy extension (see https://doc.scrapy.org/en/latest/topics/extensions.html ) that asks for a new identity (see https: / /stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor ) when a certain number of items are cleared.

However, the identity change takes some time (a couple of seconds), during which I expect nothing can be cleared. So I would like to make the extension "pause" a spider until the IP change is complete.

Is it possible? (I've read some solutions about using Cntrl + C and pointing out JOBDIR

, but this seems a little harsh as I only want to pause the spider, not stop the whole engine).

+3

python scrapy

Kurt peek May 11 '17 at 15:59

source to share

1 answer

mizhgun · Accepted Answer · 2017-05-11T16:08:49+0000

The Crawler engine has methods pause

and unpause

so you can try something like this:

class SomeExtension(object):

   @classmethod
   def from_crawler(cls, crawler)
       o = cls(...)
       o.crawler = crawler
       return o

   def change_tor(self):
       self.crawler.engine.pause()
       # some python code implements changing logic
       self.crawler.engine.unpause()

How to "pause" a spider in Scrapy?

More articles: