Zeppelin: convert pyspark.rdd.RDD to dataframe (dataframe pyspark)

Question

Zeppelin: convert pyspark.rdd.RDD to dataframe (dataframe pyspark)

I am trying to convert pyspark.rdd.RDD to dataframe. I already did it in sparks, but now it doesn't work like that in Zeppelin.

I used to convert pyspark.rdd.RDD like this:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import pandas as pd


#comment if a SparkContext has already been created   
sc = SparkContext()

conf = {"es.resource" : "index/type", "es.nodes" : "ES_Serveur", "es.port" : "9200", "es.query" : "?q=*"}
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat","org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)

#to allow the toDF methode
spark = SparkSession(sc)

df = rdd.toDF().toPandas()

And it works ... in a claim submit BUT not in Zeppelin.

I'm wondering why.

I have some error in the logs, but it is over 1000 lines. If you want, I can provide you with journal abstracts.

If anyone has an idea .. Thanks

+3

python pyspark apache-zeppelin

fjcf1 Apr 21 '17 at 13:00

source to share

1 answer

fjcf1 · Accepted Answer · 2017-04-24T09:00:59+0000

I found a solution: in the Spark Interpreter configuration (in Zeppelin) you need to change the zeppelin.spark.useHiveContext line to false . However, I don't understand why the problem occurs on the line where the toDF method is ...

Zeppelin: convert pyspark.rdd.RDD to dataframe (dataframe pyspark)

More articles: