Scrapy spider_idle causes scrape to restart

I have a scrape setup in Scrapy that targets unique 1M URLs in numerical sequence. For example: http://www.foo.com/PIN=000000000001

I keep the PINs in the DB. Instead of loading 1M PINs into memory and generating 1M start_urls, I use the start_requests () function to query the DB for 5000 PINs at a time. After finishing 5000 unique urls, I want to restart scrape and keep doing this until all 1M urls are cleaned up. In the scrapy user group, they recommended using the spider_idle function to resume cleaning. I have configured it correctly for the code below, but I cannot find the correct code to restart the scrape. See below:

class Foo(Spider):
    name = 'foo'
    allowed_domains = ['foo.com']

    def __init__(self, *args, **kwargs):
        super(Foo, self).__init__(*args, **kwargs)
        dispatcher.connect(self.spider_idle, signals.spider_idle)

    def spider_idle(self, spider):
        print 'idle function called' # this prints correctly so I know this function is getting called.
        self.start_requests() #this code does not restart the query

    def start_requests(self):
        data = self.coll.find({'status': 'unscraped'}).limit(5000)

        for row in data:
            pin = row['pin']
            url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
            yield Request(url, meta={'pin': pin})

      

What code do I need to restart the scrape?

+3


source to share


1 answer


Instead of restarting the spider, I would query the database for items unscraped

until there is nothing left:

class Foo(Spider):
    name = 'foo'
    allowed_domains = ['foo.com']

    def start_requests(self):
        while True:
            data = self.coll.find({'status': 'unscraped'}).limit(5000)

            if not data:
                break

            for row in data:
                pin = row['pin']
                url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
                yield Request(url, meta={'pin': pin})

      



You probably need to implement actual collection pagination with constraints and offsets.

+2


source







All Articles