Assigned variable is not passed to display function in Spark
I am using Spark 1.3.1 with Scala 2.10.4. I have tried a basic scenario which is to parallelize an array of three strings and map them to a variable that I define in the driver.
Here is the code:
object BasicTest extends App {
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://xxxxx:7077")
val sc = new SparkContext(conf)
val test = sc.parallelize(Array("a", "b", "c"))
val a = 5
test.map(row => row + a).saveAsTextFile("output/basictest/")
}
This piece of code works in local mode, I get a list:
a5
b5
c5
But on a real cluster I get:
a0
b0
c0
I tried with different code:
object BasicTest extends App {
def test(sc: SparkContext): Unit = {
val test = sc.parallelize(Array("a", "b", "c"))
val a = 5
test.map(row => row + a).saveAsTextFile("output/basictest/")
}
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://xxxxx:7077")
val sc = new SparkContext(conf)
test(sc)
}
This works in both cases.
I just need to understand the reasons in each case. Thanks in advance for any advice.
source to share
I believe it has to do with the use App
. This was only "allowed" for SPARK-4170 because it warns about usingApp
if (classOf[scala.App].isAssignableFrom(mainClass)) {
printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}
Notes from the ticket:
This error appears to be an issue with how the Scala method was defined and the differences between what happens at runtime versus compile time regarding the way the application uses the delayedInit function.
source to share