Read data from Hadoop HDFS using SparkSQL connector to render it in Superset?
on Ubuntu server I have configured Divolte Collector to collect data from websites. Data is stored in Hadoop HDFS (Avro files). ( http://divolte.io/ )
Then I would like to render the data with Airbnb Superset, which has multiple common database connectors (thanks to SqlAlchemy), but not HDFS.
Superset has in particular a connector for SparkSQL thanks to JDBC Hive ( http://airbnb.io/superset/installation.html#database-dependencies )
So can it be used to fetch data over HDFS stream? Thanks to
source to share
To read HDFS data in SparkSQL, there are two main ways to configure:
- Read the table as defined in Hive (read from remote metastar) (maybe not your case)
-
SparkSQL creates a built-in metastor for Hive by default (if not configured otherwise) that allows you to issue DDL and DML statements using Hive syntax. You will need an external package to work
com.databricks:spark-avro
.CREATE TEMPORARY TABLE divolte_data USING com.databricks.spark.avro OPTIONS (path "path/to/divolte/avro");
The data should now be available inside the table divolte_data
source to share