Detecting a compression codec in Hadoop from the command line

Question

Detecting a compression codec in Hadoop from the command line

Is there an easy way to find out the codec used to compress a file in Hadoop?

Do I need to write a Java program or add a file to Hive so I can use describe formatted table

?

+3

hadoop

Santiago cepas 14 nov. 14 at 16:28

source to share

2 answers

PixelCloudSt · Answer 1 · 2015-01-14T04:32:31+0000

If you are asking which codec is used by mapreduce for intermediate map output and / or final output, you can check the Hadoop config file, usually located at <HADOOP_HOME>/etc/mapred-site.xml

. However, I don't know how to check directly from the command line.

The settings for compressing intermediate maps should look something like this:

<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

The job compression settings should look something like this:

<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
</property>

<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

From these two snippets you can see that I am using the GZIP codec and that I am compressing both the intermediate card output and the final output. Hope it helps!

jkukul · Answer 2 · 2017-07-31T13:43:19+0000

One way to do this is to download the file locally (using a command hdfs dfs -get

) and then follow the procedure to determine the compression format for the local files.

This should work well for files compressed outside of Hadoop. For files created in Hadoop, this will only work for a limited number of cases, eg. text files compressed with Gzip.

Files compressed in Hadoop are more likely to be so called "container formats", for example. Avro , sequence files, parquet, etc. This means that not the entire file is compressed, but only pieces of data within the file. The hive command describe formatted table

you mention will really help you determine the input format of the base files.

Once you know the file format, you should refer to the file format documentation / source code for a link to detecting the codec. Some file formats even come with command line tools to view the file metadata that shows the compression codec. Some examples:

Avro :

hadoop jar /path/to/avro-tools.jar getmeta FILE_LOCATION_ON_HDFS --key 'avro.codec'

Parquet

hadoop jar /path/to/parquet-tools.jar meta FILE_LOCATION_ON_HDFS

Detecting a compression codec in Hadoop from the command line

More articles: