Scrapy spider_idle causes scrape to restart
I have a scrape setup in Scrapy that targets unique 1M URLs in numerical sequence. For example: http://www.foo.com/PIN=000000000001
I keep the PINs in the DB. Instead of loading 1M PINs into memory and generating 1M start_urls, I use the start_requests () function to query the DB for 5000 PINs at a time. After finishing 5000 unique urls, I want to restart scrape and keep doing this until all 1M urls are cleaned up. In the scrapy user group, they recommended using the spider_idle function to resume cleaning. I have configured it correctly for the code below, but I cannot find the correct code to restart the scrape. See below:
class Foo(Spider):
name = 'foo'
allowed_domains = ['foo.com']
def __init__(self, *args, **kwargs):
super(Foo, self).__init__(*args, **kwargs)
dispatcher.connect(self.spider_idle, signals.spider_idle)
def spider_idle(self, spider):
print 'idle function called' # this prints correctly so I know this function is getting called.
self.start_requests() #this code does not restart the query
def start_requests(self):
data = self.coll.find({'status': 'unscraped'}).limit(5000)
for row in data:
pin = row['pin']
url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
yield Request(url, meta={'pin': pin})
What code do I need to restart the scrape?
source to share
Instead of restarting the spider, I would query the database for items unscraped
until there is nothing left:
class Foo(Spider):
name = 'foo'
allowed_domains = ['foo.com']
def start_requests(self):
while True:
data = self.coll.find({'status': 'unscraped'}).limit(5000)
if not data:
break
for row in data:
pin = row['pin']
url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
yield Request(url, meta={'pin': pin})
You probably need to implement actual collection pagination with constraints and offsets.
source to share