Hive output is greater than the dfs lock limit
I have a table test
that was created in a hive. It is split into idate
, and it is often necessary to insert sections. This can leave files on hdfs that are only a few lines long.
hadoop fs -ls /db/test/idate=1989-04-01
Found 3 items
-rwxrwxrwx 3 deployer supergroup 710 2015-04-26 11:33 /db/test/idate=1989-04-01/000000_0
-rwxrwxrwx 3 deployer supergroup 710 2015-04-26 11:33 /db/test/idate=1989-04-01/000001_0
-rwxrwxrwx 3 deployer supergroup 710 2015-04-26 11:33 /db/test/idate=1989-04-01/000002_0
I'm trying to put together a simple script to combine these files to avoid a lot of small files on my partitions:
insert overwrite table test partition (idate)
select * from test
where idate = '1989-04-01'
distribute by idate
This works, it creates a new file with all lines from the old one. The problem is that when I run this script on large sections, the output is still a single file:
hadoop fs -ls /db/test/idate=2015-04-25
Found 1 items
-rwxrwxrwx 3 deployer supergroup 1400739967 2015-04-27 10:53 /db/test/idate=2015-04-25/000001_0
This file is larger than 1 GB, but the block size is set to 128 MB:
hive> set dfs.blocksize;
dfs.blocksize=134217728
I could manually set the number of reducers to keep the block size small, but shouldn't it be split automatically? Why does the hive create files larger than the allowed block size?
NOTE . These are compressed rcfiles, so I can't just concatenate them together.
source to share
It is good to have a large file that is in a split format, as a top-down job can split that file based on the block size. Typically you will get 1 output file per reducer, in order to get more reducers you must define bucketing on your table. Adjust # buckets to get the files the size you want? For your column column, select the high cardinality column that you are likely to join the candidate for.
source to share