Screenshots of Scrapy Splash?
I am trying to clean up the site by taking a screenshot of each page. So far, I've managed to put together the following code:
import json
import base64
import scrapy
from scrapy_splash import SplashRequest
class ExtractSpider(scrapy.Spider):
name = 'extract'
def start_requests(self):
url = 'https://stackoverflow.com/'
splash_args = {
'html': 1,
'png': 1
}
yield SplashRequest(url, self.parse_result, endpoint='render.json', args=splash_args)
def parse_result(self, response):
png_bytes = base64.b64decode(response.data['png'])
imgdata = base64.b64decode(png_bytes)
filename = 'some_image.png'
with open(filename, 'wb') as f:
f.write(imgdata)
It hits the site fine (like stackoverflow) and returns data for png_bytes, but when written to a file, it returns a broken image (not loading).
Is there a way to fix this, or alternatively find a better solution? I read that Splash Lua Scripts can do this, but couldn't find a way to implement it. Thank.
+3
source to share
1 answer
You decode twice from base64:
png_bytes = base64.b64decode(response.data['png'])
imgdata = base64.b64decode(png_bytes)
Just do:
def parse_result(self, response):
imgdata = base64.b64decode(response.data['png'])
filename = 'some_image.png'
with open(filename, 'wb') as f:
f.write(imgdata)
+4
source to share