Data retention in Hadoop HDFS

Question

Data retention in Hadoop HDFS

We have a Hadoop cluster with over 100TB of data in HDFS. I want to delete data older than 13 weeks in some Hive tables.

Are there any tools or ways I can achieve this?

thank

+1

hadoop hdfs

Rio mario 23 june 15 at 18:49

source to share

1 answer

brandon.bell · Accepted Answer · 2015-06-23T20:23:48+0000

To delete data older than a certain time interval, you have several options.

First, if the Hive table is partitioned by date, you can simply DROP the partitions inside Hive and drop their base directories.

The second option is to run INSERT into a new table, filtering out old data using datestamp (if available). This is probably not a good option since you have 100TB of data.

The third option is to recursively list the data catalogs for the Hive tables. hadoop fs -lsr /path/to/hive/table

... This will display a list of files and their creation dates. You can take this output, extract the date, and compare against the time frame you want to keep. If the file is older you want to keep it, run it hadoop fs -rm <file>

.

The fourth option is to grab a copy of FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image

Next, include it in a text file. hadoop oiv -i hdfs.image -o hdfs.txt

... The text file will contain HDFS text representation, the same as hadoop fs -ls ...

.

Data retention in Hadoop HDFS

More articles: