Writing data from dstream to parquet

Question

Writing data from dstream to parquet

After consuming data from kinesi with pyspark, I have a dstream with items like this:

('filename_1', [{'name': 'test'}, {'name': 'more'}, {'name': 'other'}])
('filename_2', [{'age': 15}, {'age': 25}])

Now I want to write the second part of the tuple to the location identified by the first part of the tuple.

Elsewhere I have done this by converting each list of dictionaries to a DataFrame using:

dataframe = sqlContext.createDataFrame(list_of_dicts)

and writing it down with something like:

dataframe.write.parquet('filename')

My problem is how to turn each row in a dstream into a DataFrame. My intuition was to use a map to get every single line and transform. This will require sqlContext, which you can't actually pass to the map function, since it doesn't handle this error:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063

I'm not quite tied to parquet, but I need some kind of schema (hence the workaround in the DataFrame). Is there a way to do this with a spark?

+3

apache-spark pyspark apache-spark-sql spark-streaming

DoHe Jul 28 '15 at 9:29

source to share

1 answer

Kaushal · Answer 1 · 2015-07-29T06:07:02+0000

You can create a new SqlContext instance inside the method foreach

.

words.foreachRDD(
  new Function2<JavaRDD<String>, Time, Void>() {
    @Override
    public Void call(JavaRDD<String> rdd, Time time) {
      SQLContext sqlContext = JavaSQLContextSingleton.getInstance(rdd.context());

You can find more details in this .

Writing data from dstream to parquet

More articles: