Python Scrapy, go back from child page to continue cleaning

Question

Python Scrapy, go back from child page to continue cleaning

My spider function is on a page and I need to navigate to a link and get some data from that page to add to my element, but I need to navigate to different pages from the parent page without creating any more elements. How would I go about doing this, because from what I can read in the documentation, I can only go linearly:

  parent page > next page > next page

But I need:

  parent page > next page
              > next page
              > next page

+3

python web-scraping scrapy

Barfe 16 nov. 14 at 22:46

source to share

2 answers

Using Scrapy Requests , you can perform additional operations on the following url in the callback scrapy.Request

.

0

kiran.koduru 16 nov. '14 at 23:00

source to share

alecxe · Accepted Answer · 2014-11-16T22:49:22+0000

You must return Request

instances and pass item

to meta

. And you will have to do it in a linear fashion and create a chain of requests and callbacks. To achieve this, you can pass a list of item fill requests and return the item from the last callback:

def parse_main_page(self, response):
    item = MyItem()
    item['main_url'] = response.url

    url1 = response.xpath('//a[@class="link1"]/@href').extract()[0]
    request1 = scrapy.Request(url1, callback=self.parse_page1)

    url2 = response.xpath('//a[@class="link2"]/@href').extract()[0]
    request2 = scrapy.Request(url2, callback=self.parse_page2)

    url3 = response.xpath('//a[@class="link3"]/@href').extract()[0]
    request3 = scrapy.Request(url3, callback=self.parse_page3)

    request.meta['item'] = item
    request.meta['requests'] = [request2, request3]
    return request1

def parse_page1(self, response):
    item = response.meta['item']
    item['data1'] = response.xpath('//div[@class="data1"]/text()').extract()[0]

    return request.meta['requests'].pop(0)

def parse_page2(self, response):
    item = response.meta['item']
    item['data2'] = response.xpath('//div[@class="data2"]/text()').extract()[0]

    return request.meta['requests'].pop(0)

def parse_page3(self, response):
    item = response.meta['item']
    item['data3'] = response.xpath('//div[@class="data3"]/text()').extract()[0]

    return item

Also see:

Python Scrapy, go back from child page to continue cleaning

More articles: