Filtering a large table by date
I have a VISIT_INFO table with these columns:
pers_key - unique identifyer for each person
pers_name - name of person
visit_date - date at which they visited a business
And another table, VALID_DATES, with these columns:
condition - string
start_date - date
end_date - date
I currently have the following request:
select pers_key, pers_name from VISIT_INFO a
CROSS JOIN
(select start_date, end_date from VALID_DATES where condition = 'condition1') b
WHERE (a.visit_date >= b.start_date and a.visit_date <= b.end_date)
GROUP BY a.pers_key
So condition 1 has defined start_date and end_date. I need to filter VISIT_INFO for visits that are between two dates. I am wondering if there is a more efficient way to do this. From my current understanding, it should currently go through the whole table (millions of rows) and add start_date and end_date to each row. Then do I have to iterate over each line and check the WHERE clause?
I ask this because when I remove the cross join and hardcode start_date and end_date for condition 1, it takes significantly less time. I try to avoid hard-coding dates because it will lead to serious boredom in the future.
So, to reiterate, is there a better way to filter VISIT_INFO for specific dates in VALID_DATES?
Edit: I just realized that I left quite a lot of information as this is all in HIVE. Therefore EXISTS and joins to (a between b and c) is out of the question.
source to share
Trying a version exists
is definitely a possibility. However, you might be better off not expanding the table VALID_DATES
, so there is one row in each date.
Then the request:
select vi.*
from VISIT_INFO vi JOIN
VALID_DATES_expanded vde
ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';
can use an index on VISIT_INFO(visit_date)
and on VALID_DATES_expanded(condition, valid_date)
. This is most likely the fastest approach to solving this problem if a VISIT_INFO
very large and relatively few rows are fetched on demand.
source to share