How can I write unit tests in Spark for an example of creating a basic dataframe?

I am trying to write a basic unit test to create a dataframe using the sample text file that comes with Spark as follows.

class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {

private val master = "local[*]"
private val appName = "data_load_testing"

private var spark: SparkSession = _

override def beforeEach() {
  spark = new SparkSession.Builder().appName(appName).getOrCreate()
}

import spark.implicits._

 case class Person(name: String, age: Int)

  val df = spark.sparkContext
      .textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
      .map(_.split(","))
      .map(attributes => Person(attributes(0),attributes(1).trim.toInt))
      .toDF()

  test("Creating dataframe should produce data from of correct size") {
  assert(df.count() == 3)
  assert(df.take(1).equals(Array("Michael",29)))
}

override def afterEach(): Unit = {
  spark.stop()
}

      

}

I know the code itself works (from spark.implicits ._.... toDF ()) because I tested this in Spark-Scala wrapper, but inside the test class I am getting a lot of error; The IDE does not recognize "import spark.implicits._" or "toDF" (), so the tests will not run.

I am using SparkSession, which automatically creates SparkConf, SparkContext and SQLContext under the hood.

My code just uses the example code from the Spark repo.

Any ideas why this isn't working? Thank!

NB. I've already looked at Spark unit test questions on StackOverflow like this: How do I write unit tests in Spark 2.0+? I used this to write a test, but I am still getting errors.

I am using Scala 2.11.8 and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included in the SBT assembly file. Errors when running tests:

Error: (29, 10) the value toDF is not a member of org.apache.spark.rdd.RDD [dataLoadTest.this.Person] Possible cause: maybe the semicolon is missing before `value toDF '? .toDF ()

Error: (20, 20) Stable ID required but dataLoadTest.this.spark.implicits file found. import spark.implicits._

IntelliJ will not recognize import spark.implicits._ or .toDF () method.

I imported: import org.apache.spark.sql.SparkSession import org.scalatest. {BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}

+3


source to share


2 answers


you need to assign sqlContext

for val

for implicits

. Since yours sparkSession

is var

, implicits

won't work with it

So you need to do

val sQLContext = spark.sqlContext
import sQLContext.implicits._

      



Also, you can write functions for your tests so that your test class looks like this

    class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {

  private val master = "local[*]"
  private val appName = "data_load_testing"

  var spark: SparkSession = _

  override def beforeEach() {
    spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
  }


  test("Creating dataframe should produce data from of correct size") {
    val sQLContext = spark.sqlContext
    import sQLContext.implicits._

    val df = spark.sparkContext
    .textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
    .map(_.split(","))
    .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
    .toDF()

    assert(df.count() == 3)
    assert(df.take(1)(0)(0).equals("Michael"))
  }

  override def afterEach() {
    spark.stop()
  }

}
case class Person(name: String, age: Int)

      

+4


source


There are many libraries for unit testing spark, one of which is mostly used

spark-testing-base : Holden Karau

This library has everything c sc

as below SparkContext

is a simple example

class TestSharedSparkContext extends FunSuite with SharedSparkContext {

  val expectedResult = List(("a", 3),("b", 2),("c", 4))

  test("Word counts should be equal to expected") {
    verifyWordCount(Seq("c a a b a c b c c"))
  }

  def verifyWordCount(seq: Seq[String]): Unit = {
    assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
  }
}

      

Here everything is ready with sc

howSparkContext



Another approach is to create TestWrapper

and use for a plural testcases

like below

import org.apache.spark.sql.SparkSession

trait TestSparkWrapper {

  lazy val sparkSession: SparkSession = 
    SparkSession.builder().master("local").appName("spark test example ").getOrCreate()

}

      

And use this one TestWrapper

for everyone tests

with Scala -test, playing around with BeforeAndAfterAll

and BeforeAndAfterEach

.

Hope this helps!

+1


source







All Articles