Data retention in Hadoop HDFS

We have a Hadoop cluster with over 100TB of data in HDFS. I want to delete data older than 13 weeks in some Hive tables.

Are there any tools or ways I can achieve this?

thank

+1


source to share


1 answer


To delete data older than a certain time interval, you have several options.

First, if the Hive table is partitioned by date, you can simply DROP the partitions inside Hive and drop their base directories.

The second option is to run INSERT into a new table, filtering out old data using datestamp (if available). This is probably not a good option since you have 100TB of data.



The third option is to recursively list the data catalogs for the Hive tables. hadoop fs -lsr /path/to/hive/table

... This will display a list of files and their creation dates. You can take this output, extract the date, and compare against the time frame you want to keep. If the file is older you want to keep it, run it hadoop fs -rm <file>

.

The fourth option is to grab a copy of FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image

Next, include it in a text file. hadoop oiv -i hdfs.image -o hdfs.txt

... The text file will contain HDFS text representation, the same as hadoop fs -ls ...

.

+4


source







All Articles