What are the best out-of-the-box libraries for doing web crawling in Python

Question

What are the best out-of-the-box libraries for doing web crawling in Python

I need to crawl and store locally for future analysis the content of the final list of websites. I basically want to flop all pages and keep track of all internal links to get the whole public site.

Are there free libraries for me? I saw Chilkat, but this is for a salary. I am just looking for basic functionality here. Thoughts? Suggestions?

Exact duplicate: Does anyone know of a good python based web crawler crawl that I could use?

0

python web-crawler

Jay stevens 07 jan. At 17:52

source to share

2 answers

Do you really need a library? I highly recommend Heritrix as a great all-purpose crawler that will save the entire web page (as opposed to more common crawlers, which only store part of the text). It's a little rough around the edges, but works great.

However, you can try HarvestMan http://www.harvestmanontheweb.com/

0

Vinko Vrsalovic 07 jan. 09 at 17:56

source to share

nosklo · Accepted Answer · 2009-01-07T18:03:35+0000

Use Scrapy .

It is a curl-based web crawler framework. It is still under heavy development, but it is already working. Has many goodies:

Built-in support for parsing HTML, XML, CSV and Javascript
A conveyor for scraping image elements (or any other medium) and loading image files.
Scrapy extension support by plugging in your own functionality with mediators, extensions and pipelines.
A wide range of built-in intermediaries and extensions for processing compressed data, cache files, cookies, authentication, user-agent spoofing, robots.txt processing, statistics, crawl depth limiting, etc.
An interactive shell console very useful for development and debugging
Management console for monitoring and managing your bot
Telnet console for low-level access to the Scrapy process.

Sample code to retrieve information about all torrent files added to the mininova torrent site today using the XPath selector in the HTML return

class Torrent(ScrapedItem):
    pass

class MininovaSpider(CrawlSpider):
    domain_name = 'mininova.org'
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()

        torrent.url = response.url
        torrent.name = x.x("//h1/text()").extract()
        torrent.description = x.x("//div[@id='description']").extract()
        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
        return [torrent]

What are the best out-of-the-box libraries for doing web crawling in Python

More articles: