Scrapy - sending a new request / using a callback
Doing deeper than using basic scrapers.
I understand the base class BaseSpider: name, allowed_domains and how request objects are sent for each start_url where the parsing function is used as the callback function and the parsing takes the response.
I know my syntax function stores the XPath response for the data of each class called "service-name", I believe it then goes by writing that data, storing each XPath response match into the "item" object, which is then dispatched to the class " TgmItem "in the itemss.py container.
'newUrl' contains the concatenated url that needs to be cleaned up further, I need to figure out how to get the LinkParse function to clear every newUrl found, or get all the links to clear it.
I know meta is used to parse the data of the object object and the callback gives a request to a function to send a response.
LinkParse will be used to clear all data from all links, for example: "item ['test'] = link.xpath ('test ()'). Extract ())"
def parse(self, response):
links = response.selector.xpath('//*[contains(@class, "service-name")]')
for link in links:
item = TgmItem()
item['name'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
item['newUrl'] = response.url.join(item['link'])
yield Request(newUrl, meta={'item':item}, callback=self.LinkParse)
def LinkParse(self, response):
links = response.selector.xpath('*')
for link in links:
item = response.request.meta['item']
item['test'] = link.xpath('text()').extract()
yield item
I know that in the callback function you parse the response (webpage) which should be all or every link (but I think to solve this problem I need to send the current response .url and process every / all link (s) in the ParseLink function.
I am getting a message that newUrl is undefined, assuming the request cannot accept this.
I am not expecting any help here, if someone can point me in the right direction or something for further research?
source to share
newUrl
variable is undefined. Use instead item['newUrl']
:
yield Request(item['newUrl'], meta={'item': item}, callback=self.LinkParse)
Also, the challenge response.url.join()
doesn't make sense to me. If you want to combine response.url
with an attribute href
use urljoin()
:
item['newUrl'] = urlparse.urljoin(response.url, item['link'])
Also, I'm not sure what you are trying to do in the callback LinkParse
. As I understand it, you want to follow the service-name links and get additional data for each link. Then I don't understand why you need a loop for link in links
in the method LinkParse()
.
From what I understand your method LinkParse()
should look like this:
def LinkParse(self, response):
newfield = response.selector.xpath('//myfield/text()').extract()
item = response.meta['item']
item['newfield'] = newfield
return item
source to share