Spark workflow with can

I'm trying to figure out what form the jar needs to compile in order to use Spark.

I usually write ad-hoc parsing code in the IDE and then run it locally with the data with one click (in the IDE). If my experimenting with Spark gives me the right direction, I should link my script to a jar and send it to all Spark nodes. That is, my workflow would be

  • Writing a parse script that will unload itself (created below)
  • Go make a jar.
  • Run your script.

For ad-hoc iterative work, this looks a little bit, and I don't understand how the REPL goes away without it.

Update:

Here is an example that I couldn't work with unless I compiled it into a jar and did it sc.addJar

. But the fact that I have to do this seems odd since only normal Scala and Spark code exists.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD

object Runner {
  def main(args: Array[String]) {
    val logFile = "myData.txt" 
    val conf = new SparkConf()
      .setAppName("MyFirstSpark")
      .setMaster("spark://Spark-Master:7077")

    val sc = new SparkContext(conf)

    sc.addJar("Analysis.jar")
    sc.addFile(logFile)

    val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()

    Analysis.run(logData)
  }
}

object Analysis{
   def run(logData: RDD[String]) {
    val numA = logData.filter(line => line.contains("a")).count()
    val numB = logData.filter(line => line.contains("b")).count()
    println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
  }
}

      

+3


source to share


3 answers


You create an anonymous function by using a "filter":

scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>

      

This generated function is not available if the banner is not distributed to workers. Did the stack trace occur on the desktop with a missing character?



If you just want to debug locally without having to hand over the jar, you can use the "local" wizard:

val conf = new SparkConf().setAppName("myApp").setMaster("local")

      

+1


source


While JAR generation is the most common way to handle long running Spark jobs, for interactive development, Spark has wrappers available directly in Scala, Python, and R. Current Quick Start Guide ( https://spark.apache.org/docs/ latest / quick-start.html ) only mentions Scala and Python wrappers, but the SparkR manual discusses how to work with SparkR interactively (see https://spark.apache.org/docs/latest/sparkr.html ). Good luck with your travels in Spark as you work with larger datasets :)



+1


source


You can use SparkContext.jarOfObject (Analysis.getClass) to automatically include the jar you want to distribute without packaging it.

Find the JAR from which the class was loaded to make it easier for users to pass their JARs to the SparkContext.

def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]

      

You want to do something like:

sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)

      

NTN!

+1


source







All Articles