Hive output is greater than the dfs lock limit

Question

Hive output is greater than the dfs lock limit

I have a table test

that was created in a hive. It is split into idate

, and it is often necessary to insert sections. This can leave files on hdfs that are only a few lines long.

hadoop fs -ls /db/test/idate=1989-04-01
Found 3 items
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000000_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000001_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000002_0

I'm trying to put together a simple script to combine these files to avoid a lot of small files on my partitions:

insert overwrite table test partition (idate)
select * from test
where idate = '1989-04-01'
distribute by idate

This works, it creates a new file with all lines from the old one. The problem is that when I run this script on large sections, the output is still a single file:

hadoop fs -ls /db/test/idate=2015-04-25
Found 1 items
-rwxrwxrwx   3 deployer   supergroup 1400739967 2015-04-27 10:53 /db/test/idate=2015-04-25/000001_0

This file is larger than 1 GB, but the block size is set to 128 MB:

hive> set dfs.blocksize;
dfs.blocksize=134217728

I could manually set the number of reducers to keep the block size small, but shouldn't it be split automatically? Why does the hive create files larger than the allowed block size?

NOTE . These are compressed rcfiles, so I can't just concatenate them together.

+3

hadoop hive hdfs partitioning

Mike Apr 27. 15 at 15:12

source to share

2 answers

Ok, I saw a mistake in my thoughts. My mistake was that the files listed by hdfs were actual blocks. This is not the case. The 1GB file is chunked under the hood, there is nothing wrong with having one file per section that can be parallelized when read through the basic blocks.

0

Mike Apr 27. 15:55 pm

source to share

Ashrith · Accepted Answer · 2015-04-27T15:55:38+0000

It is good to have a large file that is in a split format, as a top-down job can split that file based on the block size. Typically you will get 1 output file per reducer, in order to get more reducers you must define bucketing on your table. Adjust # buckets to get the files the size you want? For your column column, select the high cardinality column that you are likely to join the candidate for.

Hive output is greater than the dfs lock limit

More articles: