The best way to deploy Spark?

Question

The best way to deploy Spark?

Are there significant benefits to deploying Spark over YARN or EMR instead of EC2? This would be for research and prototyping in the first place, and probably using Scala. Our reluctance to not use EC2 stems mainly from the additional infrastructure and complexity that other options may have, but perhaps they also provide significant benefits?

We will mainly read / write data from / to S3.

+3

amazon-ec2 hadoop yarn apache-spark amazon-emr

alex May 07 '15 at 21:47

source to share

3 answers

EMR is "the same" as EC2 but has Hadoop installed on them. If you don't need Hive / Pig or Hadoop then I think you will pay the extra EMR cost for nothing. Takeaway: If you only need to use Spark, it's better to use EC2, you can get a "few clicks" SPARK cluster. You only need to use: spark-ec2 script to get it:

Another thing when you say YARN ... I think you misunderstood the concepts: EC2, EMR and YARN. I explain to myself: YARN (another resource negotiator) is one of two options used by SPARK to run a large cluster of machines. You can use Spark on Mesos ( https://spark.apache.org/docs/latest/running-on-mesos.html ) or Spark on Yarn ( https://spark.apache.org/docs/1.3.1/ running-on-yarn.html ). See here: http://radar.oreilly.com/2015/02/a-tale-of-two-clusters-mesos-and-yarn.html

+1

JoseM LM May 08 '15 at 5:50

source to share

When using Spark on a cluster of mesos that we create in random locations, which makes it very expensive.

Also, if you are using Spark to access S3 you can use DirectOutputCommitter which removes some of these that are necessary when writing Hadoop

0

Arnon Rotem-Gal-Oz May 08 '15 at 16:03

source to share

js84 · Accepted Answer · 2015-05-09T12:51:12+0000

Let's separate the different layers: There is an infrastructure layer, that is, where the (virtual) machines should run the spark task. Options include local machine clusters or a cluster of virtual machines leased from EC2 . Especially when writing large amounts of data to / from S3, EC2 can be a good option as both services are well integrated and usually run in the same datacenters (giving you better network performance).

The second level is software / planning from the top, which means which piece of software connects all these machines to plan and fire its spark. Options here include Yarn (which is a scheduler from the Hadoop project), Mesos (a general purpose scheduler can also handle workloads without chaos), Myriad (essentially a yarn on mesos).

A good comparison between yarn and mesos can be found here .

EMR gives you the ability to easily deploy a Hadoop / YARN cluster. There are even boot operations to set the spark on such clusters.

Hope this helped answer your question!

The best way to deploy Spark?

More articles: