Scrapy - sending a new request / using a callback

Doing deeper than using basic scrapers.

I understand the base class BaseSpider: name, allowed_domains and how request objects are sent for each start_url where the parsing function is used as the callback function and the parsing takes the response.

I know my syntax function stores the XPath response for the data of each class called "service-name", I believe it then goes by writing that data, storing each XPath response match into the "item" object, which is then dispatched to the class " TgmItem "in the itemss.py container.

'newUrl' contains the concatenated url that needs to be cleaned up further, I need to figure out how to get the LinkParse function to clear every newUrl found, or get all the links to clear it.

I know meta is used to parse the data of the object object and the callback gives a request to a function to send a response.

LinkParse will be used to clear all data from all links, for example: "item ['test'] = link.xpath ('test ()'). Extract ())"

def parse(self, response):
    links = response.selector.xpath('//*[contains(@class, "service-name")]')
    for link in links:
        item = TgmItem()
        item['name'] = link.xpath('text()').extract()
        item['link'] = link.xpath('@href').extract()
        item['newUrl'] = response.url.join(item['link'])
        yield Request(newUrl, meta={'item':item}, callback=self.LinkParse)

def LinkParse(self, response):
    links = response.selector.xpath('*')
    for link in links:
        item = response.request.meta['item']
        item['test'] = link.xpath('text()').extract()
        yield item

      

I know that in the callback function you parse the response (webpage) which should be all or every link (but I think to solve this problem I need to send the current response .url and process every / all link (s) in the ParseLink function.

I am getting a message that newUrl is undefined, assuming the request cannot accept this.

I am not expecting any help here, if someone can point me in the right direction or something for further research?

+3


source to share


1 answer


newUrl

variable is undefined. Use instead item['newUrl']

:

yield Request(item['newUrl'], meta={'item': item}, callback=self.LinkParse)

      

Also, the challenge response.url.join()

doesn't make sense to me. If you want to combine response.url

with an attribute href

use urljoin()

:

item['newUrl'] = urlparse.urljoin(response.url, item['link'])

      



Also, I'm not sure what you are trying to do in the callback LinkParse

. As I understand it, you want to follow the service-name links and get additional data for each link. Then I don't understand why you need a loop for link in links

in the method LinkParse()

.

From what I understand your method LinkParse()

should look like this:

def LinkParse(self, response):
    newfield = response.selector.xpath('//myfield/text()').extract()
    item = response.meta['item']
    item['newfield'] = newfield  
    return item

      

+2


source







All Articles