Running multiple scanners in Scrapy sequentially
I'm trying to figure out a way to run multiple Scrapy scanners at the same time without memory issues etc.
For now, this is my run script:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from CSTest.spiders.CStest import MySpider
from scrapy.utils.project import get_project_settings
def setup_crawler(urls):
spider = MySpider(urls=urls)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
filename_ = raw_input("Enter filename of seed list: ") # Enter name of list of seed sites separated by newline
log.start()
with open(filename_, "r") as starturlfile:
urls = [x.strip() for x in starturlfile.readlines()] # Put urls into a list
for domain in urls:
setup_crawler([domain]) # Setup a crawler for each domain in seedlist
reactor.run() # Start reactor
Loads a list of seed sites and runs a crawler for each site. This works great, however if I have a visit list of more than 100 sites it will crash as it cannot handle more than 100 crawlers at the same time.
To counter this, I would like to be able to run 10 scanners at a time, going through the seed list sequentially until more than 100 domains are scanned.
I would need to somehow detect when the robot has finished, so I can start from a different place.
Is there a function to see how many active crawlers there are? So I could just put a while loop like
while True:
if active_crawler_number < 10:
start_the_next_crawler()#
time.sleep(60)
I know I can do something like:
self.crawler.signals.connect(self.next_site, signal=spider_closed)
But in doing so, it pauses the program until the crawler is finished, so I could only run 1 crawler at a time, not 10.
I'm not sure what is the best way to solve this problem, so please, if you have an idea which way I should take, please post your answer :)
If I need more information to get help, just let me know, I will edit my post and add it.
source to share
What's wrong with using subprocess and spider arguments? In your example, you are using one process for all your spiders, which is a waste if you have a multi-core processor.
The subprocess also has a way to determine when the process will be executed .
Another way to do it is using Scrapyd . The project is functional, but we are looking for new maintainers.
And the third way I can think of is using Scrapy signal , I think this engine_stopped
is the one you are looking for.
source to share