Do scrapers need to be recorded for each site they target?

I'm new to scraping. I wrote a scraper that will clean the Maplin store. I used Python as the language and BeautifulSoup to clean up the store.

I want to ask if I need to clean up some other ecommerce store (say Amazon, Flipkart), I need to customize my code as they have a different HTML schema ( id

and class

different names, plus other things). So, the scraper I wrote won't work for another ecommerce store.

I want to know how price comparison sites scrape data from all online stores? Do they have a different code for different online stores or is there a common one? Do they learn HTML schema in every online store?

+3


source to share


1 answer


I need to customize my code

Yes of course. This is not only because websites have different HTML schemas. This also goes for mechanisms related to page loading / rendering: some sites use AJAX to load partial page content, others allow javascript to fill placeholders on the page, making it difficult to clean up - there can be many, many differences. Others will use web scrambling techniques: check your titles, behavior, prevent you from hitting your site often, etc.

I've also seen cases where prices were stored as images or obfuscated with "noise" - different tags inside each other that were hidden using different methods like CSS rules, classes, JS code, "display: None" etc. - to the end user in the browser, the data looked fine, but to the robot for web clips it was a mess.

Want to know how price comparison sites clear data from all online stores?

They usually use the API whenever possible. But, if not, web scraping and HTML parsing is always an option.


The general high level idea is to split the cleanup code into two main parts. Static is a generic web spider (boolean) that reads the parameters or configuration that is passed. And the dynamic - annotator / website specific configuration is usually xpath or css selector field expressions.

See, for example, the Auto Survey Tool provided by Scrapinghub

:

Autoscraping is a tool for cleaning websites without any programming knowledge. You just render the web pages visually (from the point of view and click the tool) to indicate where each field on the page is located and Autoscraping will scrape any similar page from the site.



And, FYI, explore what the Scrapinghub

docs also offer - there is a lot of useful information and a collection of different unique web clip tools.


I was personally involved in a project where we were building a generic Scrapy

spider. As far as I remember, we had a "target" database table where the records were inserted by the browser extension (annotator), the field annotations were stored in JSON:

{
    "price": "//div[@class='price']/text()",  
    "description": "//div[@class='title']/span[2]/text()"
}

      

The generic spider got the target id as a parameter, read the config and crawled the website.

We had a lot of problems on the common side. With javascript and ajax involved on the website, we started writing site-specific logic to get the data we wanted.

See also:

+9


source







All Articles