SOLVED with updated codes: Scrapy cannot clear second page with itemloader

Update: 7/29 at 22:06. Issue resolved with updated codes

Update: 7/29, 9:29 PM: After reading this post, I updated my codes.

UPDATE: 7/28/15 at 7:35 pm, following Martin's suggestion, the post has changed, but still there is no item listing or database entry.

ORIGINAL: I can successfully clear one page (base page). Now I tried to clear one of the items from another url found from the "base" page using the Query and Callback command. But it doesn't work. The spider is here:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem 
from CAPjobs.items import CAPjobsItemLoader

class CAPjobSpider(Spider):
    name = "naturejob3"
    download_delay = 2
    #allowed_domains = ["nature.com/naturejobs/"]
    start_urls = [
    "http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]

    def parse_subpage(self, response):
        il = response.meta['il']
        location = response.xpath('//div[@id="extranav"]//ul[@class="job-addresses"]/li/text()').extract()
        il.add_value('loc_pj', location)  
        yield il.load_item()

    def parse(self, response):
        hxs = Selector(response)
        sites = hxs.xpath('//div[@class="job-details"]')    

        for site in sites:

            il = CAPjobsItemLoader(CAPjobsItem(), selector = site) 
            il.add_xpath('title', 'h3/a/text()')
            il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
            il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
            url = il.get_output_value('web_url')
            yield Request(url, meta={'il': il}, callback=self.parse_subpage)

      

The scraper is now fully functional. :)

+1


source to share


1 answer


You initialize ItemLoader

like this:

il = CAPjobsItemLoader(CAPjobsItem, sites)

      

The documentation does it like this:



l = ItemLoader(item=Product(), response=response)

      

So, I think you are missing the parentheses on CAPjobsItem

, and your line should read:

il = CAPjobsItemLoader(CAPjobsItem(), sites)

      

+1


source







All Articles