Scrapy error - HTTP status code not being processed or not resolved

I am trying to start spider but have this log:

2015-05-15 12:44:43+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)
2015-05-15 12:44:43+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-15 12:44:43+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 2, 'BOT_NAME': 'reviews'}
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled item pipelines: 
2015-05-15 12:44:43+0100 [theverge] INFO: Spider opened
2015-05-15 12:44:43+0100 [theverge] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-15 12:44:43+0100 [scrapy] ERROR: Error caught on signal handler: <bound method ?.start_listening of <scrapy.telnet.TelnetConsole instance at 0x105127b48>>
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
        result = g.send(result)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 77, in start
        yield self.signals.send_catch_log_deferred(signal=signals.engine_started)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
        return signal.send_catch_log_deferred(*a, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
        *arguments, **named)
    --- <exception caught here> ---
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybeDeferred
        result = f(*args, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustApply
        return receiver(*arguments, **named)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 47, in start_listening
        self.port = listen_tcp(self.portrange, self.host, self)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/reactor.py", line 14, in listen_tcp
        return reactor.listenTCP(x, factory, interface=host)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 495, in listenTCP
        p.startListening()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/tcp.py", line 984, in startListening
        raise CannotListenError(self.interface, self.port, le)
    twisted.internet.error.CannotListenError: Couldn't listen on 127.0.0.1:6073: [Errno 48] Address already in use.

      

This first error started to appear in all spiders, but other spiders still work. "Address [Errno 48] is already in use." Then comes:

2015-05-15 12:44:43+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6198
2015-05-15 12:44:44+0100 [theverge] DEBUG: Crawled (403) <GET http://www.theverge.com/reviews> (referer: None)
2015-05-15 12:44:44+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews>: HTTP status code is not handled or not allowed
2015-05-15 12:44:44+0100 [theverge] INFO: Closing spider (finished)
2015-05-15 12:44:44+0100 [theverge] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 191,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 265,
     'downloader/response_count': 1,
     'downloader/response_status_count/403': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 5, 15, 11, 44, 44, 136026),
     'log_count/DEBUG': 3,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 5, 15, 11, 44, 43, 829689)}
2015-05-15 12:44:44+0100 [theverge] INFO: Spider closed (finished)
2015-05-15 12:44:44+0100 [scrapy] ERROR: Error caught on signal handler: <bound method ?.stop_listening of <scrapy.telnet.TelnetConsole instance at 0x105127b48>>
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
        result = g.send(result)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 300, in _finish_stopping_engine
        yield self.signals.send_catch_log_deferred(signal=signals.engine_stopped)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
        return signal.send_catch_log_deferred(*a, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
        *arguments, **named)
    --- <exception caught here> ---
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybeDeferred
        result = f(*args, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustApply
        return receiver(*arguments, **named)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 53, in stop_listening
        self.port.stopListening()
    exceptions.AttributeError: TelnetConsole instance has no attribute 'port'

      

Error "exceptions.AttributeError: TelnetConsole instance has no" port "attribute for me newbie ... Don't know what's going on as all my other spiders on other sites are working well.

Can anyone tell me how to fix it?

EDIT:

On reboot, these errors disappeared. But still he cannot crawl with this spider ... Here are the logs now:

2015-05-15 15:46:55+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)piders_toshub/reviews (spiderDev) $ scrapy crawl theverge 
2015-05-15 15:46:55+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-15 15:46:55+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 2, 'BOT_NAME': 'reviews'}
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled item pipelines: 
2015-05-15 15:46:55+0100 [theverge] INFO: Spider opened
2015-05-15 15:46:55+0100 [theverge] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-15 15:46:55+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-15 15:46:55+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-15 15:46:56+0100 [theverge] DEBUG: Crawled (403) <GET http://www.theverge.com/reviews> (referer: None)
2015-05-15 15:46:56+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews>: HTTP status code is not handled or not allowed
2015-05-15 15:46:56+0100 [theverge] INFO: Closing spider (finished)
2015-05-15 15:46:56+0100 [theverge] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 191,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 265,
     'downloader/response_count': 1,
     'downloader/response_status_count/403': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 5, 15, 14, 46, 56, 8769),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 5, 15, 14, 46, 55, 673723)}
2015-05-15 15:46:56+0100 [theverge] INFO: Spider closed (finished)

      

this "2015-05-15 15: 46: 56 + 0100 [theverge] DEBUG: Ignore response <403 http://www.theverge.com/reviews >: HTTP status code is not processed or not allowed" is strange as I use download_delay = 2 and last week I was able to crawl this website without any problem ... What could happen?

+3


source to share


2 answers


Address already in use

will suggest something different, listening to this port is most likely you are running another spider in parallel? The second error is a consequence of the first, because it did not create the port correctly, now it cannot find it to close it.

I would suggest rebooting to make sure the ports are still not in use, and only run one spider to see if it works. If this happens again, you can find out which application is using that port using a netstat

tool or a similar tool.



Update : HTTP Error 403 Forbidden most likely means that you have blocked the site from receiving too many requests. To solve this problem, use a proxy server. Checkout Scrapy HttpProxyMiddleware .

+3


source


Modifying the settings.py file in your project might be helpful for the 403 error:



DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,}

      

+2


source







All Articles