Hive (find min of n columns in a row)
I have a table in Hive with 5 columns, i.e. email, a_first_date, b_first_date, c_first_date, d_first_date.
a, b, c, d - 4 different actions the user can take and the 4 columns in the above table indicate the date the user performed the first corresponding action. For example, the value in 'a_first_date' has the date on which the user performed the action.
Output: what I want is 2 columns with email, total_first_date i.e. on what day did the user take their first action?
Example table: (Assume all values ββare of type BIGINT except email)
email, a_first_date, b_first_date, c_first_date, d_first_date
ABC, 20140707,20140702,20140801,20140907
xug, 20140107,20140822,20140201,20141007
Output:
email, overall_first_date
ABC, 20140702
xug, 20140107
A possible pair of solutions is to write a UDF or compare these values ββto each other using IF ELSE and then find the minimum, but this will involve a lot of comparisons.
Alternatively, I can do:
select email, min(action) as overall_first_date from
(
select email, a_first_date as action from mytable
UNION ALL
select email, b_first_date as action from mytable
UNION ALL
select email, c_first_date as action from mytable
UNION ALL
select email, d_first_date as action from mytable
) q1
GROUP BY email
but again this is not very good.
Can anyone suggest a better way to achieve this?
source to share
Why not use a case statement? This seems to have been discussed in detail in SQL here: https://dba.stackexchange.com/questions/21542/what-is-the-most-efficient-way-to-get-the-minimum-of-multiple-columns -on-sql-ser
source to share
You can use Hive array functions:
select email,
sort_array(array(a_first_date, b_first_date, c_first_date, d_first_date))[0] as overall_first_date
from table;
I'm not sure how this compares to the CASE statement for performance. Since you don't have a lot of columns, they are equally simple.
source to share