Type mismatch with identical types in Spark-shell

I have a workflow script around a spark shell, but I am often annoyed by bizarre type mismatches (probably inherited from scala repl) occurring with identical found and required types. The following example illustrates the problem. Done in paste mode, no problem

scala> :paste
// Entering paste mode (ctrl-D to finish)


import org.apache.spark.rdd.RDD
case class C(S:String)
def f(r:RDD[C]): String = "hello"
val in = sc.parallelize(List(C("hi")))
f(in)

// Exiting paste mode, now interpreting.

import org.apache.spark.rdd.RDD
defined class C
f: (r: org.apache.spark.rdd.RDD[C])String
in: org.apache.spark.rdd.RDD[C] = ParallelCollectionRDD[0] at parallelize at <console>:13
res0: String = hello

      

but

scala> f(in)
<console>:29: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[C]
 required: org.apache.spark.rdd.RDD[C]
              f(in)
                ^ 

      

There is a related discussion about scala repl and about spark shells but the mentioned issue seems unrelated (and resolved) to me.

This issue causes serious problems for recording an interactive session with pluggable code in a replica, or it loses much of the benefit of working in repl to begin with. Is there a solution? (And / or is this a known issue?)

edits:

Problems arising from sparking 1.2 and 1.3.0. Spark test 1.3.0 using scala 2.10.4

It seems that, at least in the test, repeating the statement using the class separately from the case class definition will reduce the problem

scala> :paste
// Entering paste mode (ctrl-D to finish)


def f(r:RDD[C]): String = "hello"
val in = sc.parallelize(List(C("hi1")))

// Exiting paste mode, now interpreting.

f: (r: org.apache.spark.rdd.RDD[C])String
in: org.apache.spark.rdd.RDD[C] = ParallelCollectionRDD[1] at parallelize at <console>:26

scala> f(in)
res2: String = hello

      

+3


source to share


1 answer


Unfortunately, this is still an open issue . Spark-wrapped code is wrapped in classes and sometimes causes strange behavior .

Another problem: Errors like value reduceByKey is not a member of org.apache.spark.rdd.RDD[(...,...)]

can be caused by using different spark versions in the same project. If you are using IntelliJ go to "File -> Project Structure -> Libraries" and delete stuff like "SBT: org.apache.spark: spark-catal_2.10: 1.1.0 : jar". You need libs with spark version 1.2.0 or 1.3.0.



Hope this helps you.

+3


source







All Articles