Mapreduce in java - gzip input files

Question

Mapreduce in java - gzip input files

I am using java

, and I am trying to write mapreduce

, which will receive as an input folder containing multiple files gz

.

I have searched everything, but all the tutorials I found can be used to process a simple text file, but found nothing that could solve my problem.

I asked at my workplace but only got scala references that I'm not familiar with.

Any help would be appreciated.

+3

java mapreduce hadoop gz

Anhermon 26 oct. 14 at 19:27

source to share

1 answer

Ashrith · Accepted Answer · 2014-10-26T19:33:56+0000

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are gzip, bzip2, and LZO. You don't need to take any further steps to extract files using these compression types; Hadoop handles it for you.

So, all you have to do is write the logic as you would for a text file and pass it to the directory containing the .gz files as input.

But the problem with gzip files is that they are not legible, imagine you have gzip files out of every 5 GB, then each handler will only process a 5 GB file instead of working with the default block size.

Mapreduce in java - gzip input files

More articles: