Spider spider is not interrupted using CloseSpider extension

Question

Spider spider is not interrupted using CloseSpider extension

I installed the Scrapy spider which parses an XML feed processing about 20,000 records.

For development purposes, I would like to limit the number of processed items. From reading the Scrapy docs I have defined, I need to use the CloseSpider extension .

I followed the guide to enable this - in my spider config I have the following:

CLOSESPIDER_ITEMCOUNT = 1
EXTENSIONS = {
    'scrapy.extensions.closespider.CloseSpider': 500,
}

However, my spider never ends - I know that the parameter CONCURRENT_REQUESTS

affects the fact that the spider actually ends (since it will handle every parallel request), but this parameter is set to the default 16 and but my spider will continue to process all items ...

I tried using a parameter CLOSESPIDER_TIMEOUT

, but in a similar way it has no effect.

Here is some debugging information with which I run the spider:

2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened

As you can see, the extensions CloseSpider

and are applied CLOSESPIDER_ITEMCOUNT

.

Any ideas why this isn't working?

+3

python python-3.x scrapy scrapy-spider

BrynJ June 15. 17 at 11:19

source to share

2 answers

BrynJ · Answer 1 · 2017-06-16T09:16:30+0000

I came up with a solution that helped the wig answer as well as my own research. It does have some kind of unexplained behavior that I will cover (comments appreciated).

In my spider file myspider_spider.py

I have (edited for brevity):

import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem

class MySpiderSpider(XMLFeedSpider):
    name = "myspiders"
    allowed_domains = {"www.mysource.com"}
    start_urls = [
        "https://www.mysource.com/source.xml"
        ]
    iterator = 'iternodes'
    itertag = 'item'
    item_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings)

    def __init__(self, settings):
        self.settings = settings

    def parse_node(self, response, node):
        if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
            raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
        else:
            self.item_count += 1
        id = node.xpath('id/text()').extract()
        title = node.xpath('title/text()').extract()
        item = MySpiderItem()
        item['id'] = id
        item['title'] = title

        return item

This works - if I set CLOSESPIDER_ITEMCOUNT

to 10, it ends up after 10 elements have been processed (so it seems to be ignoring in this regard CONCURRENT_REQUESTS

- this was unexpected).

I commented this out in mine settings.py

:

#EXTENSIONS = {
#   'scrapy.extensions.closespider.CloseSpider': 500,
#}

So, it just uses the exception CloseSpider

. However, the log shows the following:

2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 8599860,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
 'item_scraped_count': 10,
 'log_count/DEBUG': 8,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)

The main thing is to highlight the first line INFO

and finish_reason

- the message displayed in INFO

is not the one I am setting when throwing the exception CloseSpider

. This implies an expansion CloseSpider

that stops the spider, but I know it isn't? Very confusing.

parik · Answer 2 · 2017-06-15T12:49:02+0000

You can also use CloseSpider Exception to limit the number of items,

just note, CloseSpider exception is only supported in callback... ..

as you can see in the documentation

This exception can be raised from the spider callback to request the spider to close / stop. Supported arguments:

a few examples

Spider spider is not interrupted using CloseSpider extension

More articles: