How to extract src of dynamically loaded images using scrapy
I'm currently trying to clean up the https://www.bloomingdales.com site using scrapy.
In this project, I am trying to extract the url of the main image loaded on each of the product pages, for example:
However, each image is loaded with an image request on the website, so I can't just xpath find the image url. How to extract urls of images using scrapy?
Here is a screenshot of the requests I see in my Chrome developer tools:
source to share
For e-commerce websites, it is enough to distribute some json data in an html corpus and then the user's browser unpacks it into a full page.
For this particular page, if you copy the image url and search the page source, you can see all the product data stored in:
<script id="pdp_data" type="application/json">some_json</script>
You can grab this data with scrapy and decode the json into a python dictionary:
data = response.xpath("//script[@id='pdp_data']/text()").extract_first()
import json
data = json.loads(data)
# then you can parse the data
data['product']['imageSource']
# '8/optimized/9216988_fpx.tif'
source to share