Best practice for creating a SparkSession object in Scala for use in both unittest and spark-submit

I tried to write a conversion method from DataFrame to DataFrame. And I also want to test it with scalatest.

As you know, in Spark 2.x with the Scala API, you can create a SparkSession object like this:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.bulider
     .config("spark.master", "local[2]")
     .getOrCreate()

      

This code works great with unit tests. But, when I run this code with spark-submit, the cluster parameters were not working. For example,

spark-submit --master yarn --deploy-mode client --num-executors 10 ...

      

does not create any performers.

I found that spark-submit arguments are applied when I remove config("master", "local[2]")

some of the above code. But without installing the wizard, the unit test code didn't work.

I tried to separate the generating part of the spark object (SparkSession) for testing and main. But so many blocks of code require a spark like import spark.implicit,_

and spark.createDataFrame(rdd, schema)

.

Is there any best practice for writing code to create a spark object both for testing and running spark-submit?

+3


source to share


3 answers


One way is to create a trait that SparkContext / SparkSession provides and use it in your test cases, for example:

trait SparkTestContext {
  private val master = "local[*]"
  private val appName = "testing"
  System.setProperty("hadoop.home.dir", "c:\\winutils\\")
  private val conf: SparkConf = new SparkConf()
    .setMaster(master)
    .setAppName(appName)
    .set("spark.driver.allowMultipleContexts", "false")
    .set("spark.ui.enabled", "false")

  val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
  val sc: SparkContext = ss.sparkContext
  val sqlContext: SQLContext = ss.sqlContext
}

      



And your test class header looks like this:

class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{

+2


source


How to define an object where a method creates one instance of SparkSession, for example MySparkSession.get()

, and pass it as a parameter in each of your unit tests.



In the main method, you can create a separate SparkSession instance that can have different configurations.

0


source


I made a version where Spark will correctly close after tests.

import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}

trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
  var ss: SparkSession = _
  var sc: SparkContext = _
  var sqlContext: SQLContext = _

  override def beforeAll(): Unit = {
    val master = "local[*]"
    val appName = "MyApp"
    val conf: SparkConf = new SparkConf()
      .setMaster(master)
      .setAppName(appName)
      .set("spark.driver.allowMultipleContexts", "false")
      .set("spark.ui.enabled", "false")

    ss = SparkSession.builder().config(conf).getOrCreate()

    sc = ss.sparkContext
    sqlContext = ss.sqlContext
    super.beforeAll()
  }

  override def afterAll(): Unit = {
    sc.stop()
    super.afterAll()
  }
}

      

0


source







All Articles