Apache Hive concatenates different requests in one table with different conditions in each request?
I have a Hive table named "sales" with the structure below:
id,ptype,amount,time,date 1,a,12,2240,2013-12-25 1,a,4,1830,2013-12-25 1,b,2,1920,2013-12-25 1,b,3,2023,2013-12-25 2,a,5,1220,2013-12-25 2,a,1,1320,2013-12-25
Below are my queries for variable variables:
Q1: select id,sum(amount) as s_amt from sales group by id; Q2: select id, sum(amount) as s_a_amt from sales where ptype='a' group by id; Q3: select id, sum(amount) as s_b_amt from sales where ptype='b' group by id;
As far as I found out in Hive, we can only apply the "union all" option when we have the same column name or query schema. Below is the end result I want to achieve with Hive request:
id,s_amt,s_a_amt,s_b_amt 1,21,16,5 2,6,6,0
Below is one request that I tried and it completed successfully. But it will be a very painful task when you have to design the same query for more than 300 variables. Is there an efficient approach for the same task, given that we have over 300 variables? Appreciate your comments!
select t.id,max(t.s_amt) as s_amt,max(t.s_a_amt) as s_a_amt, max(t.s_b_amt) as s_b_amt from (select s1.id,sum(amount) as s_amt,0 as s_a_amt,0 as s_b_amt from sales s1 group by id union all select s2.id, 0 as s_amt, sum(amount) as s_a_amt, 0 as s_b_amt from sales s2 where ptype='a' group by id union all select s3.id, 0 as s_amt,0 as s_a_amt, sum(amount) as s_b_amt from sales s3 where ptype='b' group by id) t group by t.id;
source to share
The ideal solution is to have
Materialized Query Table (MQT) as IBM points out.
PivotTables are a special form of MQT and this is exactly what you need. Quick Definition - As the name suggests, MQT is a simple pivot table materialized on disk.
With MQT support, all you have to do is
CREATE MATERIALISED QUERY TABLE MQTA AS ( select id, sum(amount) as s_a_amt from sales where ptype='a' group by id; ) Data initially deferred Refresh deferred Maintained by User
Initially lazy data says not to insert short records into the pivot table. Refresh deferred says that the data in the table can be refreshed at any time using the REFRESH TABLE statement . User maintained says that Refersh of this table should take care of the user - Supported by the system - Another option where the system takes care of automatically updating the pivot table when the underlying table sees inserts / deletes // updates.
You can query the MQT directly as a simple select query, all the heavy lifting of the summarizing records has actually been done before, not when you query the MQT so that it is much faster.
But AFAIK HIVE doesn't support MQTs or pivot tables.
Now you know the concept, you just have to just simulate it. Create a pivot table and insert short records (REFRESH TABLE concept). You should download summary values periodically, keeping an eye on some last upload date fields so that you can only record records since the last update. You can do this with scheduled jobs - hive scripts.
INSERT INTO PTYPE_AMOUNT_MQT AS ( select * from (select s1.id,sum(amount) as s_amt,0 as s_a_amt,0 as s_b_amt from sales s1 where record_create_date > last_Refresh_date group by id union all select s2.id, 0 as s_amt, sum(amount) as s_a_amt, 0 as s_b_amt from sales s2 where ptype='a' and record_create_date > last_Refresh_date group by id union all select s3.id, 0 as s_amt,0 as s_a_amt, sum(amount) as s_b_amt from sales s3 where ptype='b' and record_create_date > last_Refresh_date group by id) )
It's always a good idea to have audit fields like record_create_date and time. The last_Refresh_date is the last time your work was done
source to share
Hive recently added GROUPING SETS as a new feature ( https://issues.apache.org/jira/browse/HIVE-3471 ). It can be much easier (write or read) than MQT. But not everyone knows about this function, and the use of CASE functions, as illustrated by Arno, is more commonly used in practice.
source to share