Extensible / customizable web crawling engines / frameworks / libraries?

I have a relatively simple case. I basically want to store link data between different websites and don't want to restrict domains. I know I can write my own crawler using some http client library, but I feel like I will be doing unnecessary work - make sure pages are not checked more than once while working out how to read and use the robots.txt file, maybe even trying to make it parallel and distributed and I'm sure there is a lot more that I haven't thought of yet.

So, I need a web crawling framework that takes care of these things, allowing me to dictate what to do with the responses (in my case, just fetch links and store them). Most searchers seem to assume that you are indexing web pages for search, and that is no good, I need something customizable.

I want to store link information in MongoDB database , so I need to be able to define how links are stored in the framework. And while I posed the question as a language agnostic, it also means that I have to restrict my choice of structure to one of the MongoDB supported languages โ€‹โ€‹(Python, Ruby, Perl, PHP, Java, and C ++), which is a very wide web. I prefer dynamic languages, but I am open to any suggestions.

I was able to find Scrapy (which looks neat) and JSpider (which seems good, but maybe too "heavy" based on the 121-page user manual), but I wanted to see if there are other good options where I'm missing.

+2


source to share


3 answers


I assume you have already searched Stack Overflow yourself as there are quite a few similar questions in the web-crawler tag . Without applying any of the following circumstances, I will not refrain from developing and just listed a few that seem to me worth considering for this task:



Okay, good luck with the review;)

+6


source


You can also try CasperJS with PhantomJS in Node.JS.



0


source


StormCrawler was not around when this question was asked, but it would fit the bill quite well. It is in Java, is highly modular and scalable, and can be configured to do what was described above.

0


source







All Articles