Scrapy: preventative measures before starting a scratch

I am going to clean up about 50.000 records on a real estate site (using Scrapy). The programming is done and tested and the database is properly designed.

But I want to be prepared for unexpected events. So how do I get the scrape to actually run flawlessly and with minimal risk of rejection and wasted time?

More specific:

  • Should I do it in stages (scraping in smaller batches)?
  • What and how should I record?
  • What other points of attention should I consider before launching?
+3


source to share


1 answer


First of all, study the following topics to get a basic understanding of how to be a good citizen scraping web pages:


In general, first, you need to make sure that you are allowed to scrape this particular website and follow their terms of use. Also check the website robots.txt

and follow the rules listed there (for example, there might be Crawl-delay

a set directive). Also, it would be a good idea to contact the website owner and let them know what you are going to do or ask for permission.

Define yourself by explicitly specifying a title User-Agent

.

See also:


Should I do it in stages (scraping in smaller batches)?

This is what DOWNLOAD_DELAY

:

The time (in seconds) that the loader should wait before loading consecutive pages from the same website. This can be used to throttle scan speed to avoid hitting servers too much.



CONCURRENT_REQUESTS_PER_DOMAIN

and CONCURRENT_REQUESTS_PER_IP

also matter.

Change these settings to avoid clicking on the website servers too often.

What and how should I record?

The information that Scrapy puts on the console is quite extensive, but you can log any errors and exceptions thrown during the scan. I personally like the idea of โ€‹โ€‹listening for a signal spider_error

that needs to be triggered, see:

What other points of attention should I consider before launching? You still have a lot to think about.

At some point, you may be banned. There is always a reason for this, the most obvious would be that you crawl too hard anyway and they don't like it. There are certain methods / tricks to avoid being banned, like rotating IP addresses, using a proxy, web scraper in the cloud, etc.

Another thing to worry about is scan speed and scaling; at this point you may need to think about extending your scanning process. This might help scrapyd

:

However, make sure you don't cross the line and stay on the legal side.

+4


source







All Articles