Scrambling large amounts of heterogeneous data into structured datasets

I appreciate the science of web scraper. The framework I'm using for this is Python / Scrapy. I'm sure there could be many more. My question is more about the basics. Let's say I need to clear news content. So I crawl the page and then write selectors to retrieve content, images, author, published date, description, comments, etc. Writing this code doesn't really matter.

The question is, how can I optimize this so that it is scalable for a large number of data sources. For example, there might be thousands of news sites, each with its own html / page structure, so inevitably I need to write scraping logic for EACH ONE OF THEM. While possible, it will require a large group of resources working for a long time to build and update these scanners / scrapers.

Is there an easy way to do this? Can I somehow make it easier to create a different scraper for each data source (website)?

How do sites such as recorded? They also have a large team working around the clock claiming they pull data from 250,000+ DISTINCT sources?

Looking forward to some enlightening responses.

Thanks Abi

+3
web-crawler web-scraping scrapy screen-scraping scraper


source to share


No one has answered this question yet

Check out similar questions:

353
Headless Browser and Scraping - Solutions
131
Can scrapy be used to clean up dynamic content from websites that use AJAX?
15
Is it possible for Scrapy to get plain text from raw HTML data?
6
Be a good citizen and web scraping
3
scrapy didn't crawl the whole link
1
Performance limitations of Scrapy (and other non-co-local scrapers / extraction solutions)
1
Caching Items in Scrapy
0
Extreme Exit Crawler Scrapy
-1
Is it possible to clear all text messages from Whatsapp Web using Scrapy?



All Articles
Loading...
X
Show
Funny
Dev
Pics