Scrapy Spider doesn't support links

I am writing a scrapy spider to crawl the NYT articles today from the homepage, but for some reason it doesn't follow any links. When I instantiate the link in scrapy shell http://www.nytimes.com

, it successfully retrieves the list of article urls using le.extract_links(response)

, but I cannot get the crawl ( scrapy crawl nyt -o out.json

) command to clear anything other than the master page. I am at my end. Is it because the homepage doesn't give articles from the parse function? Any help is appreciated.

from datetime import date                                                       

import scrapy                                                                   
from scrapy.contrib.spiders import Rule                                         
from scrapy.contrib.linkextractors import LinkExtractor                         


from ..items import NewsArticle                                                 

with open('urls/debug/nyt.txt') as debug_urls:                                  
    debug_urls = debug_urls.readlines()                                         

with open('urls/release/nyt.txt') as release_urls:                              
    release_urls = release_urls.readlines() # ["http://www.nytimes.com"]                                 

today = date.today().strftime('%Y/%m/%d')                                       
print today                                                                     


class NytSpider(scrapy.Spider):                                                 
    name = "nyt"                                                                
    allowed_domains = ["nytimes.com"]                                           
    start_urls = release_urls                                                      
    rules = (                                                                      
            Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),          
                 callback='parse', follow=True),                                   
    )                                                                              

    def parse(self, response):                                                     
        article = NewsArticle()                                                                         
        for story in response.xpath('//article[@id="story"]'):                     
            article['url'] = response.url                                          
            article['title'] = story.xpath(                                        
                    '//h1[@id="story-heading"]/text()').extract()                  
            article['author'] = story.xpath(                                       
                    '//span[@class="byline-author"]/@data-byline-name'             
            ).extract()                                                         
            article['published'] = story.xpath(                                 
                    '//time[@class="dateline"]/@datetime').extract()            
            article['content'] = story.xpath(                                   
                    '//div[@id="story-body"]/p//text()').extract()              
            yield article  

      

+3


source to share


1 answer


I found a solution to my problem. I was doing 2 things wrong:



  • I needed a subclass CrawlSpider

    , not Spider

    if I wanted it to automatically scan for sublinks.
  • On use, CrawlSpider

    I needed to use a callback function and not override parse

    . According to the docs, the override parse

    breaks the CrawlSpider

    functionality.
+3


source







All Articles