Read data from Hadoop HDFS using SparkSQL connector to render it in Superset?

Question

Read data from Hadoop HDFS using SparkSQL connector to render it in Superset?

on Ubuntu server I have configured Divolte Collector to collect data from websites. Data is stored in Hadoop HDFS (Avro files). ( http://divolte.io/ )

Then I would like to render the data with Airbnb Superset, which has multiple common database connectors (thanks to SqlAlchemy), but not HDFS.

Superset has in particular a connector for SparkSQL thanks to JDBC Hive ( http://airbnb.io/superset/installation.html#database-dependencies )

So can it be used to fetch data over HDFS stream? Thanks to

+3

hadoop hive hdfs apache-spark apache-spark-sql

Alexandre Paroissien May 13 '17 at 15:16

source to share

1 answer

Teodor-Bogdan Barbieru · Accepted Answer · 2017-05-13T15:40:04+0000

To read HDFS data in SparkSQL, there are two main ways to configure:

Read the table as defined in Hive (read from remote metastar) (maybe not your case)

SparkSQL creates a built-in metastor for Hive by default (if not configured otherwise) that allows you to issue DDL and DML statements using Hive syntax. You will need an external package to work com.databricks:spark-avro

.

CREATE TEMPORARY TABLE divolte_data
USING com.databricks.spark.avro
OPTIONS (path "path/to/divolte/avro");

The data should now be available inside the table divolte_data

Read data from Hadoop HDFS using SparkSQL connector to render it in Superset?

More articles: