Spider-based Scrapy DupeFilter?

Question

Spider-based Scrapy DupeFilter?

I currently have a project with quite a lot of spiders, and about half of them need a special rule to filter out duplicate requests. This is why I extended the RFPDupeFilter class with custom rules for each spider that needs it.

My custom crawl filter checks if the request url is from a site that needs special filtering and cleans up the url (removes query parameters, shortens paths, extracts unique parts, etc.) so the fingerprint is the same for all the same pages. So far so good, however at the moment I have a function with about 60 if / elif statements that go through every request. This is not only suboptimal, but difficult to maintain.

So here's the question. Is there a way to create a filtering rule that "scrapes" URLs inside the spider? The ideal approach for me would be to extend the Spider class and define a clean_url method that by default will just return the request url and override it in spiders that need something normal. I have looked through it, however I cannot find a way to access the current spider methods from the dupe filter class.

Any help would be much appreciated!

+3

scrapy

todinov 07 Aug '14 at 15:20

source to share

1 answer

rocktheartsm4l · Accepted Answer · 2014-08-07T20:33:04+0000

You can implement bootloader middleware.

middleware.py

class CleanUrl(object):
    seen_urls = {}
    def process_request(self, request, spider):
        url = spider.clean_url(request.url)
        if url in self.seen_urls:
              raise IgnoreRequest()
        else:
            self.seen_urls.add(url)
        return request.replace(url=url)

settings.py

DOWNLOADER_MIDDLEWARES = {'PROJECT_NAME_HERE.middleware.CleanUrl: 500} 
# if you want to make sure this is the last middleware to execute increase the 500 to 1000

You probably want to disable dupefilter all together if you've done it this way.

Spider-based Scrapy DupeFilter?

More articles: