Scrambling large amounts of heterogeneous data into structured datasets

I appreciate the science of web scraper. The framework I'm using for this is Python / Scrapy. I'm sure there could be many more. My question is more about the basics. Let's say I need to clear news content. So I crawl the page and then write selectors to retrieve content, images, author, published date, description, comments, etc. Writing this code doesn't really matter.

The question is, how can I optimize this so that it is scalable for a large number of data sources. For example, there might be thousands of news sites, each with its own html / page structure, so inevitably I need to write scraping logic for EACH ONE OF THEM. While possible, it will require a large group of resources working for a long time to build and update these scanners / scrapers.

Is there an easy way to do this? Can I somehow make it easier to create a different scraper for each data source (website)?

How do sites such as recorded? They also have a large team working around the clock claiming they pull data from 250,000+ DISTINCT sources?

Looking forward to some enlightening responses.

Thanks Abi

+3


source to share





All Articles