How to fill scrapy.Field like a dictionary
I am creating a scraper for www.apkmirror.com using Scrapy (with SitemapSpider spider). So far the following works:
DEBUG = True
from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem
class ApkmirrorSitemapSpider(SitemapSpider):
name = 'apkmirror-spider'
sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]
if DEBUG:
custom_settings = {'CLOSESPIDER_PAGECOUNT': 20}
def parse(self, response):
item = ApkmirrorScraperItem()
item['url'] = response.url
item['title'] = response.xpath('//h1[@title]/text()').extract_first()
item['developer'] = response.xpath('//h3[@title]/a/text()').extract_first()
return item
where is ApkMirrorScraperItem
defined items.py
as follows:
class ApkmirrorScraperItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
developer = scrapy.Field()
The resulting JSON output if I run it from the project directory using the command
scrapy crawl apkmirror-spider -o data.json
is an array of JSON dictionaries with keys url
, title
and developer
, and the corresponding strings as values. However, I would like to change this so that the value is developer
itself a dictionary with a field name
, so I can fill it like this:
item['developer']['name'] = response.xpath('//h3[@title]/a/text()').extract_first()
However, if I try this, I get KeyError
s, also if I initialize developer
Field
(which is dict
according to https://doc.scrapy.org/en/latest/topics/items.html#item-fields ) like developer = scrapy.Field(name=None)
. How can i do this?
source to share
Scrapy implements fields internally as dicts, but that doesn't mean they should be treated as dicts. When you call item['developer']
, what you really do gets meaningthe field, not the field itself. So, if the value hasn't been set yet, this will raise a KeyError.
With this in mind, you can solve two problems.
First, just set the developer field value to a dict:
def parse(self, response):
item = ApkmirrorScraperItem()
item['url'] = response.url
item['title'] = response.xpath('//h1[@title]/text()').extract_first()
item['developer'] = {'name': response.xpath('//h3[@title]/a/text()').extract_first()}
return item
Second, create a new developer class and set the developer value as an instance of that class:
# this can go to items.py
class Developer(scrapy.Item):
name = scrapy.Field()
def parse(self, response):
item = ApkmirrorScraperItem()
item['url'] = response.url
item['title'] = response.xpath('//h1[@title]/text()').extract_first()
dev = Developer()
dev['name'] = response.xpath('//h3[@title]/a/text()').extract_first()
item['developer'] = dev
return item
Hope this helps :)
source to share