Scrambling large amounts of heterogeneous data into structured datasets

Question

Scrambling large amounts of heterogeneous data into structured datasets

I appreciate the science of web scraper. The framework I'm using for this is Python / Scrapy. I'm sure there could be many more. My question is more about the basics. Let's say I need to clear news content. So I crawl the page and then write selectors to retrieve content, images, author, published date, description, comments, etc. Writing this code doesn't really matter.

The question is, how can I optimize this so that it is scalable for a large number of data sources. For example, there might be thousands of news sites, each with its own html / page structure, so inevitably I need to write scraping logic for EACH ONE OF THEM. While possible, it will require a large group of resources working for a long time to build and update these scanners / scrapers.

Is there an easy way to do this? Can I somehow make it easier to create a different scraper for each data source (website)?

How do sites such as recorded? They also have a large team working around the clock claiming they pull data from 250,000+ DISTINCT sources?

Looking forward to some enlightening responses.

Thanks Abi

+3

web-crawler web-scraping scrapy screen-scraping scraper

user1826116 19 Sep 14 at 10:25

source to share

No one has answered this question yet

Check out similar questions:

353

Headless Browser and Scraping - Solutions

131

Can scrapy be used to clean up dynamic content from websites that use AJAX?

15

Is it possible for Scrapy to get plain text from raw HTML data?

6

Be a good citizen and web scraping

3

scrapy didn't crawl the whole link

1

Performance limitations of Scrapy (and other non-co-local scrapers / extraction solutions)

1

Caching Items in Scrapy

0

Extreme Exit Crawler Scrapy

-1

Is it possible to clear all text messages from Whatsapp Web using Scrapy?

Scrambling large amounts of heterogeneous data into structured datasets

More articles: