Spark workflow with can

Question

Spark workflow with can

I'm trying to figure out what form the jar needs to compile in order to use Spark.

I usually write ad-hoc parsing code in the IDE and then run it locally with the data with one click (in the IDE). If my experimenting with Spark gives me the right direction, I should link my script to a jar and send it to all Spark nodes. That is, my workflow would be

Writing a parse script that will unload itself (created below)
Go make a jar.
Run your script.

For ad-hoc iterative work, this looks a little bit, and I don't understand how the REPL goes away without it.

Update:

Here is an example that I couldn't work with unless I compiled it into a jar and did it sc.addJar

. But the fact that I have to do this seems odd since only normal Scala and Spark code exists.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD

object Runner {
  def main(args: Array[String]) {
    val logFile = "myData.txt" 
    val conf = new SparkConf()
      .setAppName("MyFirstSpark")
      .setMaster("spark://Spark-Master:7077")

    val sc = new SparkContext(conf)

    sc.addJar("Analysis.jar")
    sc.addFile(logFile)

    val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()

    Analysis.run(logData)
  }
}

object Analysis{
   def run(logData: RDD[String]) {
    val numA = logData.filter(line => line.contains("a")).count()
    val numB = logData.filter(line => line.contains("b")).count()
    println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
  }
}

+3

apache-spark

Pengin 03 jul. 15 at 15:41

source to share

3 answers

While JAR generation is the most common way to handle long running Spark jobs, for interactive development, Spark has wrappers available directly in Scala, Python, and R. Current Quick Start Guide ( https://spark.apache.org/docs/ latest / quick-start.html ) only mentions Scala and Python wrappers, but the SparkR manual discusses how to work with SparkR interactively (see https://spark.apache.org/docs/latest/sparkr.html ). Good luck with your travels in Spark as you work with larger datasets :)

+1

Holden 04 Jul 15 at 23:55

source to share

You can use SparkContext.jarOfObject (Analysis.getClass) to automatically include the jar you want to distribute without packaging it.

Find the JAR from which the class was loaded to make it easier for users to pass their JARs to the SparkContext.

def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]

You want to do something like:

sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)

NTN!

+1

Priyank desai 25 Feb 16 at 19:25

source to share

Mauricio bustos · Accepted Answer · 2015-07-20T20:45:06+0000

You create an anonymous function by using a "filter":

scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>

This generated function is not available if the banner is not distributed to workers. Did the stack trace occur on the desktop with a missing character?

If you just want to debug locally without having to hand over the jar, you can use the "local" wizard:

val conf = new SparkConf().setAppName("myApp").setMaster("local")

Spark workflow with can

More articles: