How to specify the scheme for the parquet data in the hive 0.13+

Question

How to specify the scheme for the parquet data in the hive 0.13+

I have a parquet file which I made by converting some avro data file. The file contains complex entries. Also, I have an Avro schematic of these entries, as well as an equivalent parquet schematic (I got it when I converted the file). I want to make a beehive table maintained by a parquet file.

Since my record schema contains many fields, it is very difficult and error prone to declare hive columns manually to match those fields. So I want hive to define the table columns supported by my parquet file using the parquet record schema, much like AvroSerDe uses the avro schema to define the table columns. Is this supported by ParquetSerDe? How can i do this?

PS I am aware of a possible workaround where I could first define an avro enabled table using the avro schema and then use the CTAS statement to create the parquet table. But this doesn't work if there are unions in the schema, because AvroSerDe uses Hive unions, which have little to no support (!!) and ParquetSerDe doesn't know how to deal with them.

+3

hive avro parquet

miljanm 12 jan. At 13:51

source to share

3 answers

We use Hive as part of the CDH package, which also includes Impala.

Unlike Hive, Impala already has support for schema inference from Parquet files: http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_create_table.html

Note

Column definitions retrieved from data file:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] Table_name LIKE PARQUET 'hdfs_path_of_parquet_file'

This currently only works for Parquet files, not AVRO.

Because of this, we really need to use Impala for some of our workflows (e.g. after importing sqoop into parquet files or after distcp'ing from an external haop cluster - quite useful!).

+1

Tagar Apr 25. 16 at 0:01

source to share

Unfortunately there is no parquet.schema.literal available as avro.schema.literal that can be used to define a table using a schema.

You will need to create separate columns in the table definition or use CTAS statements.

As for the pooling scheme not working in the hive. I am using the connection schema definition in my avsc files for the datatype field and it works pretty

This is the structure of my avsc:

{"namespace": "somename",
 "type": "record",
 "name": "somename",
 "fields": [
     {"name": "col1", "type": "string"},
     {"name": "col2", "type": "string"},
     {"name": "col3", "type": ["string","null"]},
     {"name": "col4", "type": ["string", "null"]},
     {"name": "col5", "type": ["string", "null"]},
     {"name": "col6", "type": ["string", "null"]},
     {"name": "col7", "type": ["string", "null"]},
     {"name": "col8", "type": ["string", "null"]}  
 ]
}

0

raunakjhawar 13 jan. 15 at 8:39

source to share

miljanm · Accepted Answer · 2015-01-15T10:46:46+0000

I did a little bit of learning and got the answer, so here's for someone else stuck:

ParquetSerDe currently has no support for any table definition other than pure DDL, where you must explicitly specify each column. There is a JIRA ticket that tracks the addition of support to define a table using an existing parquet file ( HIVE-8950 ).

How to specify the scheme for the parquet data in the hive 0.13+

More articles: