Filtering a large table by date

Question

Filtering a large table by date

I have a VISIT_INFO table with these columns:

pers_key - unique identifyer for each person
pers_name - name of person
visit_date - date at which they visited a business

And another table, VALID_DATES, with these columns:

condition - string
start_date - date
end_date - date

I currently have the following request:

select pers_key, pers_name from VISIT_INFO a
CROSS JOIN
(select start_date, end_date from VALID_DATES where condition = 'condition1') b
WHERE (a.visit_date >= b.start_date and a.visit_date <= b.end_date)
GROUP BY a.pers_key

So condition 1 has defined start_date and end_date. I need to filter VISIT_INFO for visits that are between two dates. I am wondering if there is a more efficient way to do this. From my current understanding, it should currently go through the whole table (millions of rows) and add start_date and end_date to each row. Then do I have to iterate over each line and check the WHERE clause?

I ask this because when I remove the cross join and hardcode start_date and end_date for condition 1, it takes significantly less time. I try to avoid hard-coding dates because it will lead to serious boredom in the future.

So, to reiterate, is there a better way to filter VISIT_INFO for specific dates in VALID_DATES?

Edit: I just realized that I left quite a lot of information as this is all in HIVE. Therefore EXISTS and joins to (a between b and c) is out of the question.

+1

performance date sql join hive

Danzo 21 jul. 15 at 15:47

source to share

3 answers

Tripp kinetics · Answer 1 · 2015-07-21T15:54:34+0000

What about:

SELECT DISTINCT pers_key, pers_name
FROM visit_info
WHERE EXISTS
(
    SELECT 1
    FROM valid_dates
    WHERE condition = 'condition1'
    AND visit_date BETWEEN start_date AND end_date
);

?

Vamsi prabhala · Answer 2 · 2015-07-21T15:55:06+0000

with dt as (select start_date, end_date from VALID_DATES where condition = 'condition1')
select a.pers_key, a.pers_name 
from VISIT_INFO a
JOIN dt on a.visit_date between dt.start_date and dt.end_date
GROUP BY a.pers_key

Gordon linoff · Answer 3 · 2015-07-21T16:06:40+0000

Trying a version exists

is definitely a possibility. However, you might be better off not expanding the table VALID_DATES

, so there is one row in each date.

Then the request:

select vi.*
from VISIT_INFO vi JOIN
     VALID_DATES_expanded vde
     ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';

can use an index on VISIT_INFO(visit_date)

and on VALID_DATES_expanded(condition, valid_date)

. This is most likely the fastest approach to solving this problem if a VISIT_INFO

very large and relatively few rows are fetched on demand.

Filtering a large table by date

More articles: