Creating DataFrame in Spark Stream

Question

Creating DataFrame in Spark Stream

I connected the Kafka stream to the Sparks. Just like I prepared Apache Spark Mlib model for streaming text based prediction. My problem is to get the prediction that I need to pass to the DataFramework.

//kafka stream    
val stream = KafkaUtils.createDirectStream[String, String](
          ssc,
          PreferConsistent,
          Subscribe[String, String](topics, kafkaParams)
        )
//load mlib model
val model = PipelineModel.load(modelPath)
 stream.foreachRDD { rdd =>

      rdd.foreach { record =>
       //to get a prediction need to pass DF
       val toPredict = spark.createDataFrame(Seq(
          (1L, record.value())
        )).toDF("id", "review")
        val prediction = model.transform(test)
      }
}

My problem is Spark streaming doesn't allow DataFrame to be created. Is there a way to do this? Can I use a case class or structure?

+1

sparse-matrix apache-spark apache-spark-mllib apache-kafka spark-streaming

Damith ganegoda 10 jul. 17 at 5:35 am

source to share

1 answer

maasg · Accepted Answer · 2017-07-10T08:59:51+0000

It is possible to create DataFrame

or Dataset

from an RDD, as in the core of Spark. To do this, we need to apply a schema. Internally, foreachRDD

we can then transform the resulting RDD into a DataFrame, which can then be used with the ML pipeline.

// we use a schema in the form of a case class
case class MyStructure(field:type, ....)
// and we implement our custom transformation from string to our structure
object MyStructure {
    def parse(str: String) : Option[MyStructure] = ...
}

val stream = KafkaUtils.createDirectStream... 
// give the stream a schema using a case class
val strucStream =  stream.flatMap(cr => MyStructure.parse(cr.value))

strucStream.foreachRDD { rdd =>
    import sparkSession.implicits._
    val df = rdd.toDF()
    val prediction = model.transform(df)
    // do something with df
}

Creating DataFrame in Spark Stream

More articles: