The best way to deploy Spark?
Are there significant benefits to deploying Spark over YARN or EMR instead of EC2? This would be for research and prototyping in the first place, and probably using Scala. Our reluctance to not use EC2 stems mainly from the additional infrastructure and complexity that other options may have, but perhaps they also provide significant benefits?
We will mainly read / write data from / to S3.
source to share
Let's separate the different layers: There is an infrastructure layer, that is, where the (virtual) machines should run the spark task. Options include local machine clusters or a cluster of virtual machines leased from EC2 . Especially when writing large amounts of data to / from S3, EC2 can be a good option as both services are well integrated and usually run in the same datacenters (giving you better network performance).
The second level is software / planning from the top, which means which piece of software connects all these machines to plan and fire its spark. Options here include Yarn (which is a scheduler from the Hadoop project), Mesos (a general purpose scheduler can also handle workloads without chaos), Myriad (essentially a yarn on mesos).
A good comparison between yarn and mesos can be found here .
EMR gives you the ability to easily deploy a Hadoop / YARN cluster. There are even boot operations to set the spark on such clusters.
Hope this helped answer your question!
source to share
EMR is "the same" as EC2 but has Hadoop installed on them. If you don't need Hive / Pig or Hadoop then I think you will pay the extra EMR cost for nothing. Takeaway: If you only need to use Spark, it's better to use EC2, you can get a "few clicks" SPARK cluster. You only need to use: spark-ec2 script to get it:
- https://spark.apache.org/docs/latest/ec2-scripts.html
- http://ampcamp.berkeley.edu/exercises-strata-conf-2013/launching-a-cluster.html
Another thing when you say YARN ... I think you misunderstood the concepts: EC2, EMR and YARN. I explain to myself: YARN (another resource negotiator) is one of two options used by SPARK to run a large cluster of machines. You can use Spark on Mesos ( https://spark.apache.org/docs/latest/running-on-mesos.html ) or Spark on Yarn ( https://spark.apache.org/docs/1.3.1/ running-on-yarn.html ). See here: http://radar.oreilly.com/2015/02/a-tale-of-two-clusters-mesos-and-yarn.html
source to share
When using Spark on a cluster of mesos that we create in random locations, which makes it very expensive.
Also, if you are using Spark to access S3 you can use DirectOutputCommitter which removes some of these that are necessary when writing Hadoop
source to share