How to optimize a query on a large dataset?

Question

How to optimize a query on a large dataset?

My original request is -

CREATE TABLE admin.FctPrfitAmt_rpt AS 
SELECT rcn.* FROM 
(SELECT t1.* FROM (SELECT * FROM admin.FctPrfitAmt t2 WHERE t2.scenario_id NOT IN(SELECT DISTINCT t3.scenario_id FROM admin.FctPrfitAmt_incr t3)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr) t1) rcn;

The problem is that this query is currently taking a long time, since the number of records is very large.

Is there a way to customize this query?

I have tried this approach but it does not work -

CREATE TABLE admin.FctPrfitAmt_rpt AS
SELECT * FROM admin.FctPrfitAmt t2 
WHERE t2.scenario_id NOT exists (SELECT 1 from  admin.FctPrfitAmt_incr t3 where t2.scenario_id = t3.scenario_id)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr

Error - It looks like "does not exist" is not supported in my version of Hive, so for my approach I got the following error:

Operator compilation error: FAILED: ParseException 3:25 line cannot recognize input near 'NOT' 'exists' '(' in expression specification

+3

performance hadoop hive bigdata

akash sharma Apr 21 17 at 16:22

source to share

3 answers

You are better off leaving the join of the two tables in the "select" section and filter out rows that are not null for the join key.

CREATE TABLE admin.FctPrfitAmt_rpt AS 
    SELECT rcn.* FROM 
    (
        SELECT t1.* 
        FROM admin.FctPrfitAmt t1 
        LEFT JOIN admin.FctPrfitAmt_incr t2
        ON t1.scenario_id = t2.scenario_id
        WHERE t2.scenario_id IS NULL
        UNION ALL
        SELECT * FROM admin.FctPrfitAmt_incr
    ) rcn
;

+2

Paul back Apr 21 17 at 16:39

source to share

select  *

from   (select  *
               ,max(tab) over (partition by scenario_id) as max_tab

        from    (           select *,1 as tab from master.FctPrfitAmt
                union all   select *,2 as tab from master.FctPrfitAmt_incr
                ) t
        ) t

where   tab     = 2
     or max_tab = 1
;

If all of your data consists of primitive types (no arrays, maps, etc.),
you can use the following query:

select  inline(array(original_rowset))

from   (select  original_rowset
               ,tab
               ,max(tab) over (partition by scenario_id) as max_tab

        from    (           select struct(*) as original_rowset,scenario_id,1 as tab from FctPrfitAmt
                union all   select struct(*) as original_rowset,scenario_id,2 as tab from FctPrfitAmt_incr
                ) t
        ) t

where   tab     = 2
     or max_tab = 1

0

David דודו Markovitz Apr 24 17 at 8:51

source to share

David דודו Markovitz · Accepted Answer · 2017-04-22T06:48:21+0000

Your syntax is incorrect. NOT EXISTS

should not precedet2.scenario_id
As we can see, script_id is malformed on both tables, which creates a huge product when combined.

select  * 
from    admin.FctPrfitAmt   pa
where   not exists 
        (
            select  null

            from   (select  distinct 
                            pfa.scenario_id 

                    from    admin.FctPrfitAmt_incr  pfa
                    ) pfa

            where   pfa.scenario_id = 
                    pa.scenario_id
        )

union all

select  * 
from    admin.FctPrfitAmt_incr

How to optimize a query on a large dataset?

More articles: