How to fill scrapy.Field like a dictionary

I am creating a scraper for www.apkmirror.com using Scrapy (with SitemapSpider spider). So far the following works:

DEBUG = True

from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem


class ApkmirrorSitemapSpider(SitemapSpider):
    name = 'apkmirror-spider'
    sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]

    if DEBUG:
        custom_settings = {'CLOSESPIDER_PAGECOUNT': 20}

    def parse(self, response):
        item = ApkmirrorScraperItem()
        item['url'] = response.url
        item['title'] = response.xpath('//h1[@title]/text()').extract_first()
        item['developer'] = response.xpath('//h3[@title]/a/text()').extract_first()
        return item

      

where is ApkMirrorScraperItem

defined items.py

as follows:

class ApkmirrorScraperItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    developer = scrapy.Field()

      

The resulting JSON output if I run it from the project directory using the command

scrapy crawl apkmirror-spider -o data.json

      

is an array of JSON dictionaries with keys url

, title

and developer

, and the corresponding strings as values. However, I would like to change this so that the value is developer

itself a dictionary with a field name

, so I can fill it like this:

item['developer']['name'] = response.xpath('//h3[@title]/a/text()').extract_first()

      

However, if I try this, I get KeyError

s, also if I initialize developer

Field

(which is dict

according to https://doc.scrapy.org/en/latest/topics/items.html#item-fields ) like developer = scrapy.Field(name=None)

. How can i do this?

+3


source to share


1 answer


Scrapy implements fields internally as dicts, but that doesn't mean they should be treated as dicts. When you call item['developer']

, what you really do gets meaningthe field, not the field itself. So, if the value hasn't been set yet, this will raise a KeyError.

With this in mind, you can solve two problems.

First, just set the developer field value to a dict:

def parse(self, response):
    item = ApkmirrorScraperItem()
    item['url'] = response.url
    item['title'] = response.xpath('//h1[@title]/text()').extract_first()
    item['developer'] = {'name': response.xpath('//h3[@title]/a/text()').extract_first()}
    return item

      



Second, create a new developer class and set the developer value as an instance of that class:

# this can go to items.py
class Developer(scrapy.Item):
    name = scrapy.Field()

def parse(self, response):
    item = ApkmirrorScraperItem()
    item['url'] = response.url
    item['title'] = response.xpath('//h1[@title]/text()').extract_first()

    dev = Developer()        
    dev['name'] = response.xpath('//h3[@title]/a/text()').extract_first()       
    item['developer'] = dev

    return item

      

Hope this helps :)

+3


source







All Articles