Get files generated in last 5 minutes in hadoop using shell script
I have files in HDFS like:
drwxrwx--- - root supergroup 0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639064
drwxrwx--- - root supergroup 0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639065
The directory /tmp/logs/root/logs/
will now continually receive new files. I want to get files created in the last five minutes given the current time. Then I need to copy these files to my local machine.
source to share
I did it using the command below, it will give me the files created between the five minute window:
hadoop fs -ls /tmp/logs/root/logs | awk '{ if ((($6 == "'"2016-08-18"'" && $7 <= "'"21:00"'") && ($6 == "'"2016-08-18"'" && $7 >= "'"20:55"'"))) print $8 } '
It can be changed accordingly using the current time stamp.
source to share
How about this:
hdfs dfs -ls /tmp | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=5; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF < LAST){ print $3 }}'
Explanation:
List of all files:
hdfs dfs -ls / tmp
Replace extra spaces:
tr -s ""
Get required columns:
cut -d '' -f6-8
Remove unnecessary lines:
grep "^ [0-9]"
Processing with awk:
AWK
Initialize DIFF duration and current time:
MIN = 5; LAST = 60 * MIN; "date +% s" | getline NOW
Create a command to get the epoch value for a file timestamp on HDFS:
cmd = "date -d '\' '" $ 1 "$ 2"' \ '' +% s ";
Run the command to get the epoch value for the HDFS file:
cmd | getline WHEN;
Get the time difference:
DIFF = NOW WHEN-;
Print the output depending on the difference:
if (DIFF <LAST) {print $ 3}
You just need to change the value of the variable for MIN
depending on your requirement (here its 5 minutes). NTN
source to share