Running multiple scanners in Scrapy sequentially

I'm trying to figure out a way to run multiple Scrapy scanners at the same time without memory issues etc.

For now, this is my run script:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from CSTest.spiders.CStest import MySpider
from scrapy.utils.project import get_project_settings


def setup_crawler(urls):
    spider = MySpider(urls=urls)
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

filename_ = raw_input("Enter filename of seed list: ") # Enter name of list of seed sites separated by newline
log.start()


with open(filename_, "r") as starturlfile:
    urls = [x.strip() for x in starturlfile.readlines()] # Put urls into a list

for domain in urls:
    setup_crawler([domain]) # Setup a crawler for each domain in seedlist




reactor.run() # Start reactor

      

Loads a list of seed sites and runs a crawler for each site. This works great, however if I have a visit list of more than 100 sites it will crash as it cannot handle more than 100 crawlers at the same time.

To counter this, I would like to be able to run 10 scanners at a time, going through the seed list sequentially until more than 100 domains are scanned.

I would need to somehow detect when the robot has finished, so I can start from a different place.

Is there a function to see how many active crawlers there are? So I could just put a while loop like

while True:
    if active_crawler_number < 10:
        start_the_next_crawler()#
    time.sleep(60)

      

I know I can do something like:

self.crawler.signals.connect(self.next_site, signal=spider_closed)

      

But in doing so, it pauses the program until the crawler is finished, so I could only run 1 crawler at a time, not 10.

I'm not sure what is the best way to solve this problem, so please, if you have an idea which way I should take, please post your answer :)

If I need more information to get help, just let me know, I will edit my post and add it.

+3


source to share


1 answer


What's wrong with using subprocess and spider arguments? In your example, you are using one process for all your spiders, which is a waste if you have a multi-core processor.

The subprocess also has a way to determine when the process will be executed .



Another way to do it is using Scrapyd . The project is functional, but we are looking for new maintainers.

And the third way I can think of is using Scrapy signal , I think this engine_stopped

is the one you are looking for.

0


source







All Articles