Scrapy Spider doesn't support links
I am writing a scrapy spider to crawl the NYT articles today from the homepage, but for some reason it doesn't follow any links. When I instantiate the link in scrapy shell http://www.nytimes.com
, it successfully retrieves the list of article urls using le.extract_links(response)
, but I cannot get the crawl ( scrapy crawl nyt -o out.json
) command to clear anything other than the master page. I am at my end. Is it because the homepage doesn't give articles from the parse function? Any help is appreciated.
from datetime import date
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ..items import NewsArticle
with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()
with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]
today = date.today().strftime('%Y/%m/%d')
print today
class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)
def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article
+3
source to share
1 answer
I found a solution to my problem. I was doing 2 things wrong:
- I needed a subclass
CrawlSpider
, notSpider
if I wanted it to automatically scan for sublinks. - On use,
CrawlSpider
I needed to use a callback function and not overrideparse
. According to the docs, the overrideparse
breaks theCrawlSpider
functionality.
+3
source to share