Get files generated in last 5 minutes in hadoop using shell script

I have files in HDFS like:

drwxrwx---   - root supergroup          0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639064
drwxrwx---   - root supergroup          0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639065

      

The directory /tmp/logs/root/logs/

will now continually receive new files. I want to get files created in the last five minutes given the current time. Then I need to copy these files to my local machine.

+1


source to share


2 answers


I did it using the command below, it will give me the files created between the five minute window:

hadoop fs -ls /tmp/logs/root/logs | awk '{ if ((($6 == "'"2016-08-18"'" && $7 <= "'"21:00"'") && ($6 == "'"2016-08-18"'" && $7 >= "'"20:55"'"))) print $8 } ' 

      



It can be changed accordingly using the current time stamp.

0


source


How about this:

hdfs dfs -ls /tmp | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=5; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF < LAST){ print $3 }}'

      

Explanation:

List of all files:

hdfs dfs -ls / tmp

Replace extra spaces:

tr -s ""

Get required columns:

cut -d '' -f6-8

Remove unnecessary lines:

grep "^ [0-9]"

Processing with awk:



AWK

Initialize DIFF duration and current time:

MIN = 5; LAST = 60 * MIN; "date +% s" | getline NOW

Create a command to get the epoch value for a file timestamp on HDFS:

cmd = "date -d '\' '" $ 1 "$ 2"' \ '' +% s ";

Run the command to get the epoch value for the HDFS file:

cmd | getline WHEN;

Get the time difference:

DIFF = NOW WHEN-;

Print the output depending on the difference:

if (DIFF <LAST) {print $ 3}

You just need to change the value of the variable for MIN

depending on your requirement (here its 5 minutes). NTN

0


source







All Articles