Parquet: reading individual columns into memory

I exported mysql table to parquet file (avro based). Now I want to read specific columns from this file. How can I read individual columns completely? I am looking for Java code examples.

Is there an api where I can pass the columns I need and return the 2D array of the table?

+3


source to share


3 answers


If you can use a hive, creating a hive table and issuing a simple select query would be the easiest option.

create external table tbl1(<columns>) location '<file_path>' stored as parquet; select col1,col2 from tbl1; //this works in hive 0.14

You can use the JDBC driver for this and from the java program.



Otherwise, if you want to stay completely in java, you need to change the avro schema to exclude all fields except the ones you want to extract. Then, when you read the file, supply the modified circuit as the reader circuit, and it will only read the included columns. But you will get the original avro entry with the fields excluded, but not the 2D array.

To change the schema, look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure the modified schema is compatible with the original schema.

+1


source


Parameters:



  • Use Hive table to create a table with all columns with storage format lot and read the required columns by specifying the column names
  • Create Thrift for the table and use thrift fields to read data from code (Java or Scala)
0


source


You can also use the apache drill which handles parquet files natively.

0


source







All Articles