Scrapy: preventative measures before starting a scratch

Question

Scrapy: preventative measures before starting a scratch

I am going to clean up about 50.000 records on a real estate site (using Scrapy). The programming is done and tested and the database is properly designed.

But I want to be prepared for unexpected events. So how do I get the scrape to actually run flawlessly and with minimal risk of rejection and wasted time?

More specific:

Should I do it in stages (scraping in smaller batches)?
What and how should I record?
What other points of attention should I consider before launching?

+3

python web-crawler web-scraping scrapy

S Leon 15 nov. 14 at 16:54

source to share

1 answer

alecxe · Accepted Answer · 2014-11-15T17:18:26+0000

First of all, study the following topics to get a basic understanding of how to be a good citizen scraping web pages:

In general, first, you need to make sure that you are allowed to scrape this particular website and follow their terms of use. Also check the website robots.txt

and follow the rules listed there (for example, there might be Crawl-delay

a set directive). Also, it would be a good idea to contact the website owner and let them know what you are going to do or ask for permission.

Define yourself by explicitly specifying a title User-Agent

.

Scrapy: preventative measures before starting a scratch

More articles: