How to get bad urls with scrapy
I have two custom intermediaries to catch the failed URLs:
DOWNLOADER_MIDDLEWARES={
'soufang.misc.middleware.CustomRecordMiddleware':860,
'soufang.misc.middleware.CustomFailresponseMiddleware':298,
'soufang.misc.middleware.CustomHttpProxyMiddleware':297,
}
CustomRecordMiddlware
very close to the bootloader so that it can catch all exceptions that are counted by the default middleware DownloaderStats
.
CustomHttpProxyMiddleware
and CustomFailresponseMiddleware
catches urls and exceptions that still failed after retries.
Below middleware.py
:
from agents import AGENTS
from usefulproxy350 import PROXIES
from scrapy import log
import random
class CustomUserAgentMiddleware(object):
def process_request(self, request, spider):
agent = random.choice(AGENTS)
request.headers['User-Agent'] = agent
class CustomHttpProxyMiddleware(object):
def process_request(self, request, spider):
agent = random.choice(AGENTS)
request.headers['User-Agent'] = agent
p = random.choice(PROXIES)
try:
request.meta['proxy'] = "http://%s" % p
except Exception, e:
log.msg("Exception %s" % e, _level=log.CRITICAL)
def process_exception(self, request, exception, spider):
url = request.url
proxy = request.meta['proxy']
myfile = open('outurl_excep.txt','a')
myfile.write(url+'\n')
myfile.write(proxy+'\n')
myfile.close()
class CustomFailresponseMiddleware(object):
def process_response(self,request,response,spider):
try:
if response.status != 200 or len(response.headers)==0 :
myfile = open('outurl_respo.txt','a')
myfile.write(response.url + '\n')
myfile.close()
return request
return response
except Exception,e:
log.msg("Response Exception %s" % e)
class CustomRecordMiddleware(object):
def process_exception(self,request,exception,spider):
url = request.url
proxy = request.meta['proxy']
myfile = open('outurl_record.txt','a')
myfile.write(url+'\n')
myfile.write(proxy+'\n')
myfile.close()
log.msg('Fail to request url %s with exception %s' % (url, str(exception)))
There seems to be some more bad urls that I haven't caught. When I scanned from page 51, the crawler seems to stop after 24 pages.
Below are the statistics:
2015-05-27 13:04:15+0800 [soufang_redis] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 55,
'downloader/exception_type_count/twisted.internet.error.ConnectError': 6,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 18,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 9,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 21,
'downloader/request_bytes': 230985,
'downloader/request_count': 582,
'downloader/request_method_count/GET': 582,
'downloader/response_bytes': 8174486,
'downloader/response_count': 527,
'downloader/response_status_count/200': 505,
'downloader/response_status_count/400': 1,
'downloader/response_status_count/404': 4,
'downloader/response_status_count/502': 10,
'downloader/response_status_count/503': 7,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2015, 5, 27, 5, 4, 15, 945815),
'item_dropped_count': 5,
'item_dropped_reasons_count/DropItem': 5,
'item_scraped_count': 475,
'log_count/INFO': 82,
'log_count/WARNING': 5,
'request_depth_max': 24,
'response_received_count': 505,
'scheduler/dequeued/redis': 582,
'scheduler/enqueued/redis': 582,
'start_time': datetime.datetime(2015, 5, 27, 4, 47, 13, 889437)}
I checked mine outurl_record.txt
and the number of exceptions recorded is 55 which is exactly the same downloader/exception_count
. request_depth_max
only 24 (it should have been 51), but I haven't found any bad information about page 25 in outurl_record.txt
. I also did not find in outurl_excep.txt
and outurl_respo.txt
.
I've tried it several times and sometimes it can scan all pages and sometimes it doesn't.
What did I miss?
source to share
No one has answered this question yet
Check out similar questions: