What are the best out-of-the-box libraries for doing web crawling in Python

I need to crawl and store locally for future analysis the content of the final list of websites. I basically want to flop all pages and keep track of all internal links to get the whole public site.

Are there free libraries for me? I saw Chilkat, but this is for a salary. I am just looking for basic functionality here. Thoughts? Suggestions?


Exact duplicate: Does anyone know of a good python based web crawler crawl that I could use?

0


source to share


2 answers


Use Scrapy .

It is a curl-based web crawler framework. It is still under heavy development, but it is already working. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV and Javascript
  • A conveyor for scraping image elements (or any other medium) and loading image files.
  • Scrapy extension support by plugging in your own functionality with mediators, extensions and pipelines.
  • A wide range of built-in intermediaries and extensions for processing compressed data, cache files, cookies, authentication, user-agent spoofing, robots.txt processing, statistics, crawl depth limiting, etc.
  • An interactive shell console very useful for development and debugging
  • Management console for monitoring and managing your bot
  • Telnet console for low-level access to the Scrapy process.


Sample code to retrieve information about all torrent files added to the mininova torrent site today using the XPath selector in the HTML return

class Torrent(ScrapedItem):
    pass

class MininovaSpider(CrawlSpider):
    domain_name = 'mininova.org'
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()

        torrent.url = response.url
        torrent.name = x.x("//h1/text()").extract()
        torrent.description = x.x("//div[@id='description']").extract()
        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
        return [torrent]

      

+7


source


Do you really need a library? I highly recommend Heritrix as a great all-purpose crawler that will save the entire web page (as opposed to more common crawlers, which only store part of the text). It's a little rough around the edges, but works great.



However, you can try HarvestMan http://www.harvestmanontheweb.com/

0


source







All Articles