Set parquet file file size - beehive?

Question

Set parquet file file size - beehive?

I am trying to split the parquet / snappy files generated by hive INSERT OVERWRITE TABLE ... at the dfs.block.size boundary as impala gives a warning when the file in the section is larger than the block size.

impala logs the following WARNINGS:

Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar)

Code:

CREATE TABLE <TABLE_NAME>(<FILEDS>)
PARTITIONED BY (
    year SMALLINT,
    month TINYINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY");

As for the INSERT

hql script:

SET dfs.block.size=134217728;
SET hive.exec.reducers.bytes.per.reducer=134217728;
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=134217728;
SET hive.merge.smallfiles.avgsize=67108864;
SET hive.exec.compress.output=true;
SET mapred.max.split.size=134217728;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE <TABLE_NAME>
PARTITION (year=<YEAR>, month=<MONTH>)
SELECT <FIELDS>
from <ANOTHER_TABLE> where year=<YEAR> and month=<MONTH>;

The problem is deleting files all over the place:

partition 1: 1 file: size = 163.9 M 
partition 2: 2 file: size = 207.4 M, 128.0 M
partition 3: 3 file: size = 166.3 M, 153.5 M, 162.6 M
partition 4: 3 file: size = 151.4 M, 150.7 M, 45.2 M

The problem is the same regardless of whether the parameter dfs.block.size

(and other settings above) is increased to 256M

, 512M

or 1G

(for different datasets).

Is there a way / tweak to ensure that the splitting of the output parquets / instant files is just below the hdfs block size?

+3

hive parquet snappy impala

Hatim diab June 15. 15 at 15:13

source to share

3 answers

blue · Answer 1 · 2015-11-13T23:25:19+0000

There is no way to close files once they grow to the size of one HDFS block and start a new file. This would be at odds with how HDFS works: files that span many blocks.

The right solution for Impala is to schedule your tasks where the blocks are local rather than complaining about the file spanning more than one block. It was completed recently as IMPALA-1881 and will be released in Impala 2.3.

Stamperious · Answer 2 · 2015-06-17T03:46:47+0000

You need both parquet block size and dfs block size:

SET dfs.block.size=134217728;  
SET parquet.block.size=134217728;

Both should be set the same because you want the parquet block to be inside the hdfs block.

Tagar · Answer 3 · 2015-07-24T01:35:49+0000

In some cases, you can set the parquet block size by setting mapred.max.split.size (parquet 1.4.2+), which you have already done. You can put it below the hdfs block size to increase parallelism. Parquet tries to align to hdfs blocks if possible:

https://github.com/Parquet/parquet-mr/pull/365

Edit 11/16/2015: According to https://github.com/Parquet/parquet-mr/pull/365#issuecomment-157108975 this could also be IMPALA-1881 which is fixed in Impala 2.3.

Set parquet file file size - beehive?

More articles: