Make transitions to the next page recursively

Question

Make transitions to the next page recursively

I am trying to clean up this page using scrapy. I can successfully clear data on the page, but I want to clear data from other pages as well. (those that say next). this is the relevant part of my code:

def parse(self, response):
    item = TimemagItem()
    item['title']= response.xpath('//div[@class="text"]').extract()
    links = response.xpath('//h3/a').extract()
    crawledLinks=[]
    linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$")

    for link in links:
        if linkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append(link)
        yield Request(link, self.parse)

    yield item

I am getting the correct information: titles from linked pages, but it just isn't "navigation". how can i tell about plasma for navigation?

+3

python scrapy

user46257 31 oct. 14 at 19:19

source to share

1 answer

André Teixeira · Answer 1 · 2014-10-31T19:51:51+0000

Have a look at the Scrapy Link Extractors documentation . This is the correct way to tell your spider to follow the links on the page.

Taking a look at the page you want to crawl, I believe you should do it with 2 extractor rules. Here's an example of a simple spider with rules that match your TIMES web pages:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class TIMESpider(CrawlSpider):
    name = "time_spider"
    allowed_domains = ["time.com"]
    start_urls = [
        'http://search.time.com/results.html?N=45&Ns=p_date_range|1&Ntt=&Nf=p_date_range%7cBTWN+19500101+19500130'
    ]

    rules = (
        Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="tout"]/h3/a',))
            , callback='parse'),
        Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@title="Next"]',))
            , follow= True),
        ) 

    def parse(self, response):
        item = TimemagItem()
        item['title']= response.xpath('.//title/text()').extract()

        return item

Make transitions to the next page recursively

More articles: