Hive (find min of n columns in a row)

I have a table in Hive with 5 columns, i.e. email, a_first_date, b_first_date, c_first_date, d_first_date.

a, b, c, d - 4 different actions the user can take and the 4 columns in the above table indicate the date the user performed the first corresponding action. For example, the value in 'a_first_date' has the date on which the user performed the action.

Output: what I want is 2 columns with email, total_first_date i.e. on what day did the user take their first action?

Example table: (Assume all values ​​are of type BIGINT except email)

email, a_first_date, b_first_date, c_first_date, d_first_date
ABC, 20140707,20140702,20140801,20140907
xug, 20140107,20140822,20140201,20141007

Output:

email, overall_first_date
ABC, 20140702
xug, 20140107

A possible pair of solutions is to write a UDF or compare these values ​​to each other using IF ELSE and then find the minimum, but this will involve a lot of comparisons.

Alternatively, I can do:

select email, min(action) as overall_first_date from

(
select email, a_first_date as action from mytable
UNION ALL
select email, b_first_date as action from mytable
UNION ALL
select email, c_first_date as action from mytable
UNION ALL
select email, d_first_date as action from mytable
) q1

GROUP BY email 

      

but again this is not very good.

Can anyone suggest a better way to achieve this?

+3


source to share


3 answers


Why not use a case statement? This seems to have been discussed in detail in SQL here: https://dba.stackexchange.com/questions/21542/what-is-the-most-efficient-way-to-get-the-minimum-of-multiple-columns -on-sql-ser



+2


source


You can use Hive array functions:

select email, 
       sort_array(array(a_first_date, b_first_date, c_first_date, d_first_date))[0]  as overall_first_date
from table;

      



I'm not sure how this compares to the CASE statement for performance. Since you don't have a lot of columns, they are equally simple.

+5


source


Use the minimum () function. For example; Select *, smallest (col1, col2, col3) as minofcol from Tablename;

+4


source







All Articles