How can I write unit tests in Spark for an example of creating a basic dataframe?
I am trying to write a basic unit test to create a dataframe using the sample text file that comes with Spark as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know the code itself works (from spark.implicits ._.... toDF ()) because I tested this in Spark-Scala wrapper, but inside the test class I am getting a lot of error; The IDE does not recognize "import spark.implicits._" or "toDF" (), so the tests will not run.
I am using SparkSession, which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code just uses the example code from the Spark repo.
Any ideas why this isn't working? Thank!
NB. I've already looked at Spark unit test questions on StackOverflow like this: How do I write unit tests in Spark 2.0+? I used this to write a test, but I am still getting errors.
I am using Scala 2.11.8 and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included in the SBT assembly file. Errors when running tests:
Error: (29, 10) the value toDF is not a member of org.apache.spark.rdd.RDD [dataLoadTest.this.Person] Possible cause: maybe the semicolon is missing before `value toDF '? .toDF ()
Error: (20, 20) Stable ID required but dataLoadTest.this.spark.implicits file found. import spark.implicits._
IntelliJ will not recognize import spark.implicits._ or .toDF () method.
I imported: import org.apache.spark.sql.SparkSession import org.scalatest. {BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
source to share
you need to assign sqlContext
for val
for implicits
. Since yours sparkSession
is var
, implicits
won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Also, you can write functions for your tests so that your test class looks like this
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
source to share
There are many libraries for unit testing spark, one of which is mostly used
spark-testing-base : Holden Karau
This library has everything c sc
as below SparkContext
is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here everything is ready with sc
howSparkContext
Another approach is to create TestWrapper
and use for a plural testcases
like below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this one TestWrapper
for everyone tests
with Scala -test, playing around with BeforeAndAfterAll
and BeforeAndAfterEach
.
Hope this helps!
source to share