Scrambling large amounts of heterogeneous data into structured datasets
I appreciate the science of web scraper. The framework I'm using for this is Python / Scrapy. I'm sure there could be many more. My question is more about the basics. Let's say I need to clear news content. So I crawl the page and then write selectors to retrieve content, images, author, published date, description, comments, etc. Writing this code doesn't really matter.
The question is, how can I optimize this so that it is scalable for a large number of data sources. For example, there might be thousands of news sites, each with its own html / page structure, so inevitably I need to write scraping logic for EACH ONE OF THEM. While possible, it will require a large group of resources working for a long time to build and update these scanners / scrapers.
Is there an easy way to do this? Can I somehow make it easier to create a different scraper for each data source (website)?
How do sites such as recorded? They also have a large team working around the clock claiming they pull data from 250,000+ DISTINCT sources?
Looking forward to some enlightening responses.
Thanks Abi
source to share
No one has answered this question yet
Check out similar questions: