Spider spider is not interrupted using CloseSpider extension
I installed the Scrapy spider which parses an XML feed processing about 20,000 records.
For development purposes, I would like to limit the number of processed items. From reading the Scrapy docs I have defined, I need to use the CloseSpider extension .
I followed the guide to enable this - in my spider config I have the following:
CLOSESPIDER_ITEMCOUNT = 1 EXTENSIONS = { 'scrapy.extensions.closespider.CloseSpider': 500, }
However, my spider never ends - I know that the parameter CONCURRENT_REQUESTS
affects the fact that the spider actually ends (since it will handle every parallel request), but this parameter is set to the default 16 and but my spider will continue to process all items ...
I tried using a parameter CLOSESPIDER_TIMEOUT
, but in a similar way it has no effect.
Here is some debugging information with which I run the spider:
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened
As you can see, the extensions CloseSpider
and are applied CLOSESPIDER_ITEMCOUNT
.
Any ideas why this isn't working?
source to share
I came up with a solution that helped the wig answer as well as my own research. It does have some kind of unexplained behavior that I will cover (comments appreciated).
In my spider file myspider_spider.py
I have (edited for brevity):
import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem
class MySpiderSpider(XMLFeedSpider):
name = "myspiders"
allowed_domains = {"www.mysource.com"}
start_urls = [
"https://www.mysource.com/source.xml"
]
iterator = 'iternodes'
itertag = 'item'
item_count = 0
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings)
def __init__(self, settings):
self.settings = settings
def parse_node(self, response, node):
if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
else:
self.item_count += 1
id = node.xpath('id/text()').extract()
title = node.xpath('title/text()').extract()
item = MySpiderItem()
item['id'] = id
item['title'] = title
return item
This works - if I set CLOSESPIDER_ITEMCOUNT
to 10, it ends up after 10 elements have been processed (so it seems to be ignoring in this regard CONCURRENT_REQUESTS
- this was unexpected).
I commented this out in mine settings.py
:
#EXTENSIONS = {
# 'scrapy.extensions.closespider.CloseSpider': 500,
#}
So, it just uses the exception CloseSpider
. However, the log shows the following:
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 8599860,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'closespider_itemcount',
'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
'item_scraped_count': 10,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
The main thing is to highlight the first line INFO
and finish_reason
- the message displayed in INFO
is not the one I am setting when throwing the exception CloseSpider
. This implies an expansion CloseSpider
that stops the spider, but I know it isn't? Very confusing.
source to share
You can also use CloseSpider Exception to limit the number of items,
just note, CloseSpider exception is only supported in callback... ..
as you can see in the documentation
This exception can be raised from the spider callback to request the spider to close / stop. Supported arguments:
source to share