Data retention in Hadoop HDFS
To delete data older than a certain time interval, you have several options.
First, if the Hive table is partitioned by date, you can simply DROP the partitions inside Hive and drop their base directories.
The second option is to run INSERT into a new table, filtering out old data using datestamp (if available). This is probably not a good option since you have 100TB of data.
The third option is to recursively list the data catalogs for the Hive tables. hadoop fs -lsr /path/to/hive/table
... This will display a list of files and their creation dates. You can take this output, extract the date, and compare against the time frame you want to keep. If the file is older you want to keep it, run it hadoop fs -rm <file>
.
The fourth option is to grab a copy of FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image
Next, include it in a text file. hadoop oiv -i hdfs.image -o hdfs.txt
... The text file will contain HDFS text representation, the same as hadoop fs -ls ...
.
source to share