How to extract src of dynamically loaded images using scrapy

Question

How to extract src of dynamically loaded images using scrapy

I'm currently trying to clean up the https://www.bloomingdales.com site using scrapy.

In this project, I am trying to extract the url of the main image loaded on each of the product pages, for example:

https://www.bloomingdales.com/shop/product/free-people-over-the-rainbow-beanie?ID=1791385&CategoryID=1006048#fn=ppp%3D%26spp%3D1%26sp%3D1%26rid%3D83% 26spc% 3D94% 26rsid% 3Dundefined% 26pn% 3D1 | 2 | 1 | 94

However, each image is loaded with an image request on the website, so I can't just xpath find the image url. How to extract urls of images using scrapy?

Here is a screenshot of the requests I see in my Chrome developer tools:

+3

python scrapy

taphos Apr 12 17 at 8:04

source to share

1 answer

Granitosaurus · Accepted Answer · 2017-04-12T08:29:56+0000

For e-commerce websites, it is enough to distribute some json data in an html corpus and then the user's browser unpacks it into a full page.

For this particular page, if you copy the image url and search the page source, you can see all the product data stored in:

<script id="pdp_data" type="application/json">some_json</script>

You can grab this data with scrapy and decode the json into a python dictionary:

data = response.xpath("//script[@id='pdp_data']/text()").extract_first()
import json
data = json.loads(data)
# then you can parse the data
data['product']['imageSource']
# '8/optimized/9216988_fpx.tif'

How to extract src of dynamically loaded images using scrapy

More articles: