Scrapy: collect duplicate messages
The maximum number of times a crawler can search as described here . Upon achieving this, I received an error similar to the following:
Gave up retrying <GET https:/foo/bar/123> (failed 3 times)
I believe the post is generated by the code here .
However, I want to draw some conclusions about surrenders . Specifically I'm wondering if it is possible to:
- Extract the
123
(id) part of the url and write those ids in a separate file decently. - Access to information
meta
in the originalrequest
. This documentation might be helpful.
source to share
You can subclass scrapy.contrib.downloadermiddleware.retry.RetryMiddleware
and override _retry()
to do whatever you want with a request than opt out.
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy import log
class CustomRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
# do something with the request: inspect request.meta, look at request.url...
log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
Then it is a matter of referencing this custom middleware component in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
'myproject.middlewares.CustomRetryMiddleware': 500,
}
source to share