Hadoop on EC2 vs. ElasticMapReduce / S3

I've been using ElasticMapReduce for a while now. It's pretty handy, but I can't get HBase up and running since the Hadoop cluster is only temporarily available (I've asked a few related questions in HBase and Hadoop ).

So, I want to try installing Hadoop on a set of EC2 machines. I know Hadoop has an EC2 related directory - src / contrib / ec2. It looks like a Hadoop cluster can be started just by typing a command and I can enter the node master to start jobs and so on. Before trying this, I would like to know any errors from ppl that have used this. Thank!


source to share

1 answer

Indeed, there are two use cases for hadoop on amazon - providing your own cluster or usint EMR. Orthogonally to this solution, you can use HDFS or S3 as the file system. This is not a short story, but I will try to work out some pros and cons of all of these options.
You can use EMR if you need to run single / multiple jobs per day and don't need allo time clusters. In this case, you put your data in s3 and you can complete the script process completely. The main disadvantage is the difficult setup, the use of third-party libraries, etc. In this case, you will also save time for installing the cluster. If you want to customize hadoop, you must set up your own cluster.
When your data is already in s3 or you need to save it after processing - s3 is a good choice. At the same time, you will get slightly lower performance than using HDFS. It should be pointed out that Amazon instances have very little local storage - so it gets very expensive and you have to keep running on the cluster (and pay for it) to keep that storage.
I would say that if you really need HDFS with all its throuput, you really need your own cluster on your own hardware. When you're working on Amazon, it's most practical to use S3 as your filesystem.



All Articles