Scrapy stops crawling after several pages
I'm just picking up the basics of Scrapy and website crawlers so I would really appreciate your input. I've built a simple and simple finder from Scrapy following a tutorial.
It works great, but it won't scan all pages as it should.
My spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from fraist.items import FraistItem
import re
class fraistspider(BaseSpider):
name = "fraistspider"
allowed_domain = ["99designs.com"]
start_urls = ["http://99designs.com/designer-blog/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select("//div[@class='pagination']/a/@href").extract()
#We stored already crawled links in this list
crawledLinks = []
#Pattern to check proper link
linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
if linkPattern.match(link) and not link in crawledLinks:
crawledLinks.append(link)
yield Request(link, self.parse)
posts = hxs.select("//article[@class='content-summary']")
items = []
for post in posts:
item = FraistItem()
item["title"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/text()").extract()
item["link"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/@href").extract()
item["content"] = post.select("div[@class='summary']/p/text()").extract()
items.append(item)
for item in items:
yield item
And the result:
'title': [u'Design a poster in the style of Saul Bass']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Scraped from <200 http://nnbdesig
ner.wpengine.com/designer-blog/>
{'content': [u'Helping a company come up with a branding strategy can be
exciting\xa0and intimidating, all at once. It gives a designer the opportunity
to make a great visual impact with a brand, but requires skills in logo, print a
nd digital design. If you\u2019ve been hesitating to join a 99designs Brand Iden
tity Pack contest, here are a... '],
'link': [u'http://99designs.com/designer-blog/2015/05/07/tips-brand-ide
ntity-pack-design-success/'],
'title': [u'99designs\u2019 tips for a successful Brand Identity Pack d
esign']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/10/
>
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/11/
>
2015-05-20 16:22:41+0100 [fraistspider] INFO: Closing spider (finished)
2015-05-20 16:22:41+0100 [fraistspider] INFO: Stored csv feed (100 items) in: da
ta.csv
2015-05-20 16:22:41+0100 [fraistspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4425,
'downloader/request_count': 16,
'downloader/request_method_count/GET': 16,
'downloader/response_bytes': 126915,
'downloader/response_count': 16,
'downloader/response_status_count/200': 11,
'downloader/response_status_count/301': 5,
'dupefilter/filtered': 41,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 20, 15, 22, 41, 738000),
'item_scraped_count': 100,
'log_count/DEBUG': 119,
'log_count/INFO': 8,
'request_depth_max': 5,
'response_received_count': 11,
'scheduler/dequeued': 16,
'scheduler/dequeued/memory': 16,
'scheduler/enqueued': 16,
'scheduler/enqueued/memory': 16,
'start_time': datetime.datetime(2015, 5, 20, 15, 22, 40, 718000)}
2015-05-20 16:22:41+0100 [fraistspider] INFO: Spider closed (finished)
As you can see, it 'item_scraped_count'
equals 100, although this should be much more, since there are only 122 pages, 10 articles per page.
From the output I can see that there are 301 redirection issues, but I don't understand why this is causing problems. I tried a different approach rewriting my spider code, but again it breaks after multiple entries around the same part.
Any help would be much appreciated. Thank!
source to share
It looks like you are clicking the default 100 items defined at http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items .
In this case, I'll download the CrawlSpider to traverse multiple pages, so you need to define a rule that matches the 99designs.com pages and visually modify your parse function to handle the element.
C&P sample code from Scrapy docs :
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
Edit: I just found this blog post which contains a helpful example.
source to share