How to share an instance of an object among spiders running on scrapyd

I need to split one object instance among crawlers / spiders running on scrapyd. A better scenario would be to bind object methods to each spider signal, something like

ext = CommonObject()
crawler.signals.connect( ext.onSpiderOpen,   signal = signals.spider_opened )
crawler.signals.connect( ext.onSpiderClose,  signal = signals.spider_closed )

etc..

      

where the CommonObject will only be instantiated and initialized once and expose its methods to all traversal / spider processes that are running (I don't mind using a Singleton for this purpose).

Based on my research, I understand that I have two options:

  • Run all spiders / crawlers inside one CrawlerProcess where a CommonObject instance will be instantiated as well.
  • Run one spider / crawler on the CrawlerProcess (default behavior (d)), instantiate the CommonObject somewhere in the reactor, and maybe access it remotely with twisted.spread.pb.

Questions:

  • Are there any CPU usage penalties (less CPU efficient) using the first option, allowing scrapyd to manage processes (which is the second option)? Is CrawlerProcess capable of running more crawlers in parallel (not sequentially)? How do you plan for further runtime spiders within the same CrawlerProcess? (I understand that CrawlerProcess.start () is blocking.)
  • I'm not advanced enough to implement the second option (actually not sure if this is a viable option at all). Is there anyone out there to draw an example implementation?
  • Perhaps I am missing something and is there another way to do this?
+3


source to share





All Articles