Persistent screening cache

Question

Persistent screening cache

We need to be able to rescan the historical data. Imagine today, June 23rd. We are browsing the website today, but after a few days we realize that we need to re-scan it, "seeing" it the same way we did on the 23rd. This means that all possible redirects, GET and POST requests, etc. ALL pages the spider sees should be exactly the same as on 23rd, no matter what.

Use case: If there are changes to the website and our spider is unable to crawl something, we want to be able to go back in time and start the spider again after we fix it.

Typically, it should be pretty simple - subclassing the standard Scrapy cache, make it use dates for subfolders, and have something like this:

cache/spider_name/2015-06-23/HERE ARE THE CACHED DIRS

but when I experimented with this, I sometimes realized that the spider was crawling a live site. This means that some of the pages are missing from the cache (although the corresponding files exist on disk), but instead they grab them from the live website. This happened with CAPTCHA pages in particular, but possibly some others.

How can we make Scrapy always fetch the page from the cache and not hit the website at all? Ideally, it should work even without an internet connection.

Update: We used Dummy policy and HTTPCACHE_EXPIRATION_SECS = 0

Thank!

+3

caching scrapy

Spaceman 23 june 15 at 14:57

source to share

1 answer

bmetge · Answer 1 · 2017-08-29T13:05:26+0000

To do what you want, you must have this in your settings:

HTTPCACHE_IGNORE_MISSING = True

Then, if enabled, requests not found in the cache will be ignored rather than loaded.

When you set: HTTPCACHE_EXPIRATION_SECS = 0

It only assures you that "cached requests never expire", but if the page is not in your cache then it will load.

You can check the documentation.

Persistent screening cache

More articles: