Parquet: reading individual columns into memory

Question

Parquet: reading individual columns into memory

I exported mysql table to parquet file (avro based). Now I want to read specific columns from this file. How can I read individual columns completely? I am looking for Java code examples.

Is there an api where I can pass the columns I need and return the 2D array of the table?

+3

mapreduce avro parquet

Pratik khadloya 15 Aug 14 at 21:27

source to share

3 answers

miljanm · Answer 1 · 2015-01-15T16:49:07+0000

If you can use a hive, creating a hive table and issuing a simple select query would be the easiest option.

create external table tbl1(<columns>) location '<file_path>' stored as parquet; select col1,col2 from tbl1; //this works in hive 0.14

You can use the JDBC driver for this and from the java program.

Otherwise, if you want to stay completely in java, you need to change the avro schema to exclude all fields except the ones you want to extract. Then, when you read the file, supply the modified circuit as the reader circuit, and it will only read the included columns. But you will get the original avro entry with the fields excluded, but not the 2D array.

To change the schema, look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure the modified schema is compatible with the original schema.

maxmithun · Answer 2 · 2015-09-08T02:01:27+0000

Parameters:

Use Hive table to create a table with all columns with storage format lot and read the required columns by specifying the column names
Create Thrift for the table and use thrift fields to read data from code (Java or Scala)

jocelyn · Answer 3 · 2015-09-21T16:42:48+0000

You can also use the apache drill which handles parquet files natively.

0

jocelyn 21 Sep 15 at 16:42

source to share

Parquet: reading individual columns into memory

More articles: