Read data from Hadoop HDFS using SparkSQL connector to render it in Superset?

on Ubuntu server I have configured Divolte Collector to collect data from websites. Data is stored in Hadoop HDFS (Avro files). ( http://divolte.io/ )

Then I would like to render the data with Airbnb Superset, which has multiple common database connectors (thanks to SqlAlchemy), but not HDFS.

Superset has in particular a connector for SparkSQL thanks to JDBC Hive ( http://airbnb.io/superset/installation.html#database-dependencies )

So can it be used to fetch data over HDFS stream? Thanks to

+3


source to share


1 answer


To read HDFS data in SparkSQL, there are two main ways to configure:

  • Read the table as defined in Hive (read from remote metastar) (maybe not your case)
  • SparkSQL creates a built-in metastor for Hive by default (if not configured otherwise) that allows you to issue DDL and DML statements using Hive syntax. You will need an external package to work com.databricks:spark-avro

    .

    CREATE TEMPORARY TABLE divolte_data
    USING com.databricks.spark.avro
    OPTIONS (path "path/to/divolte/avro");
    
          



The data should now be available inside the table divolte_data

+2


source







All Articles