How to optimize a query on a large dataset?
My original request is -
CREATE TABLE admin.FctPrfitAmt_rpt AS
SELECT rcn.* FROM
(SELECT t1.* FROM (SELECT * FROM admin.FctPrfitAmt t2 WHERE t2.scenario_id NOT IN(SELECT DISTINCT t3.scenario_id FROM admin.FctPrfitAmt_incr t3)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr) t1) rcn;
The problem is that this query is currently taking a long time, since the number of records is very large.
Is there a way to customize this query?
I have tried this approach but it does not work -
CREATE TABLE admin.FctPrfitAmt_rpt AS
SELECT * FROM admin.FctPrfitAmt t2
WHERE t2.scenario_id NOT exists (SELECT 1 from admin.FctPrfitAmt_incr t3 where t2.scenario_id = t3.scenario_id)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr
Error - It looks like "does not exist" is not supported in my version of Hive, so for my approach I got the following error:
Operator compilation error: FAILED: ParseException 3:25 line cannot recognize input near 'NOT' 'exists' '(' in expression specification
+3
source to share
3 answers
- Your syntax is incorrect.
NOT EXISTS
should not precedet2.scenario_id
- As we can see, script_id is malformed on both tables, which creates a huge product when combined.
select *
from admin.FctPrfitAmt pa
where not exists
(
select null
from (select distinct
pfa.scenario_id
from admin.FctPrfitAmt_incr pfa
) pfa
where pfa.scenario_id =
pa.scenario_id
)
union all
select *
from admin.FctPrfitAmt_incr
+2
source to share
You are better off leaving the join of the two tables in the "select" section and filter out rows that are not null for the join key.
CREATE TABLE admin.FctPrfitAmt_rpt AS
SELECT rcn.* FROM
(
SELECT t1.*
FROM admin.FctPrfitAmt t1
LEFT JOIN admin.FctPrfitAmt_incr t2
ON t1.scenario_id = t2.scenario_id
WHERE t2.scenario_id IS NULL
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr
) rcn
;
+2
source to share
select *
from (select *
,max(tab) over (partition by scenario_id) as max_tab
from ( select *,1 as tab from master.FctPrfitAmt
union all select *,2 as tab from master.FctPrfitAmt_incr
) t
) t
where tab = 2
or max_tab = 1
;
If all of your data consists of primitive types (no arrays, maps, etc.),
you can use the following query:
select inline(array(original_rowset))
from (select original_rowset
,tab
,max(tab) over (partition by scenario_id) as max_tab
from ( select struct(*) as original_rowset,scenario_id,1 as tab from FctPrfitAmt
union all select struct(*) as original_rowset,scenario_id,2 as tab from FctPrfitAmt_incr
) t
) t
where tab = 2
or max_tab = 1
0
source to share