Request for daily requests with limited dates for several weeks
I am trying to find active users over time on a daily basis.
A user is active when they have made more than 10 requests per week for 4 consecutive weeks .
i.e. on October 31, 2014, a user is active if he has made more than 10 requests in total in the week between:
- October 24 - October 30, 2014 AND
- Oct 17-Oct 23, 2014 AND
- Oct 10-Oct 16, 2014 AND
- Oct 3-Oct 9, 2014
I have a table requests
:
CREATE TABLE requests (
id text PRIMARY KEY, -- id of the request
amount bigint, -- sum of requests made by accounts_id to recipient_id,
-- aggregated on a daily basis based on "date"
accounts_id text, -- id of the user
recipient_id text, -- id of the recipient
date timestamp -- date that the request was made in YYYY-MM-DD
);
Examples of values:
INSERT INTO requests2
VALUES
('1', 19, 'a1', 'b1', '2014-10-05 00:00:00'),
('2', 19, 'a2', 'b2', '2014-10-06 00:00:00'),
('3', 85, 'a3', 'b3', '2014-10-07 00:00:00'),
('4', 11, 'a1', 'b4', '2014-10-13 00:00:00'),
('5', 2, 'a2', 'b5', '2014-10-14 00:00:00'),
('6', 50, 'a3', 'b5', '2014-10-15 00:00:00'),
('7', 787323, 'a1', 'b6', '2014-10-17 00:00:00'),
('8', 33, 'a2', 'b8', '2014-10-18 00:00:00'),
('9', 14, 'a3', 'b9', '2014-10-19 00:00:00'),
('10', 11, 'a4', 'b10', '2014-10-19 00:00:00'),
('11', 1628, 'a1', 'b11', '2014-10-25 00:00:00'),
('13', 101, 'a2', 'b11', '2014-10-25 00:00:00');
Output example:
Date | # Active users
-----------+---------------
10-01-2014 | 600
10-02-2014 | 703
10-03-2014 | 891
Here I tried to find the number of active users on a specific date (e.g. 10-01-2014):
SELECT count(*)
FROM
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '2 weeks' AND '2014-10-01'::date - interval '1 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '3 weeks' AND '2014-10-01'::date - interval '2 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '4 weeks' AND '2014-10-01'::date - interval '3 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '5 weeks' AND '2014-10-01'::date - interval '4 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
Since this is just a request to get a number for 1 day, I need to get this number daily over time. I think the idea is to make a join to get the date, so I tried to do something like this:
SELECT week_1."Date_series",
count(*)
FROM
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '2 weeks' AND requests.date::date - interval '1 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '3 weeks' AND requests.date::date - interval '2 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
AND week_1."Date_series" = week_2."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '4 weeks' AND requests.date::date - interval '3 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
AND week_2."Date_series" = week_3."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '5 weeks' AND requests.date::date - interval '4 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
AND week_3."Date_series" = week_4."Date_series"
GROUP BY week_1."Date_series"
However, I think I am not getting the correct answer and I do not know why. Any tips / advice / pointers are greatly appreciated! :) :)
PS. I am using Postgres 9.3
source to share
Here's a long answer on how to keep your queries short. :)
Table
Based on my table (before you specified the table definition with different (odd!) Data types:
CREATE TABLE requests (
id int
, accounts_id int -- (id of the user)
, recipient_id int -- (id of the recipient)
, date date -- (date that the request was made in YYYY-MM-DD)
, amount int -- (# of requests by accounts_id for the day)
);
Active users for a given day
List of "active users" for one given day :
SELECT accounts_id
FROM (
SELECT w.w, r.accounts_id
FROM (
SELECT w
, day - 6 - 7 * w AS w_start
, day - 7 * w AS w_end
FROM (SELECT '2014-10-31'::date - 1 AS day) d -- effective date here
, generate_series(0,3) w
) w
JOIN requests r ON r."date" BETWEEN w_start AND w_end
GROUP BY w.w, r.accounts_id
HAVING sum(r.amount) > 10
) sub
GROUP BY 1
HAVING count(*) = 4;
Step 1
In the innermost subqueryw
(for "week"), plot the boundaries of the four-week interest from a CROSS JOIN
given day - 1 with output generate_series(0-3)
.
To add / subtract days to / from date
(not from the timestamp!), Just add / subtract numbers integer
. The expression day - 7 * w
subtracts 0-3 times 7 days from the specified date, reaching the end dates for each week ( w_end
).
Count another 6 days (not 7!) From each to calculate the corresponding start ( w_start
).
Also, save the week number w
(0-3) for later aggregation.
Step 2
In a subquery,sub
join rows from requests
a set of 4 weeks where the date is between the start and end dates. GROUP BY
week number w
and accounts_id
.
Only weeks with more than 10 requests are cumulative.
Step 3
The externalSELECT
counts the number of weeks that each user ( accounts_id
) qualified. Must be 4 to qualify as an "active user"
Number of active users per day
It's dynamite .
Wrapped in a simple SQL function to simplify general use, but the query can be used on its own:
CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
SELECT accounts_id, date, sum(amount)::int AS amount
FROM requests
WHERE date BETWEEN _now - (27 + _days) AND _now - 1
GROUP BY accounts_id, date
)
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM (
SELECT accounts_id, date
, count(w_amount > 10 OR NULL)
OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
FROM (
SELECT accounts_id, date, dow
, sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
FROM (SELECT _now - i AS date, i%7 AS dow
FROM generate_series(1, 27 + _days) i) d -- period of interest
CROSS JOIN (
SELECT accounts_id FROM r
GROUP BY 1
HAVING count(*) > 3 AND sum(amount) > 39 -- enough rows & requests
AND max(date) > min(date) + 15) a -- can cover 4 weeks
LEFT JOIN r USING (accounts_id, date)
) sub1
WHERE date > _now - (22 + _days) -- cut off 6 trailing days now - useful?
) sub2
GROUP BY date
ORDER BY date DESC
LIMIT _days
$func$ LANGUAGE sql STABLE;
The function accepts any day ( _now
), "today" as the default and the number of days ( _days
) in the result, 3 by default. Call:
SELECT * FROM f_active_users('2014-10-31', 5);
Or without parameters to use default values:
SELECT * FROM f_active_users();
The approach is different from the first request .
SQL Fiddle with two queries and options to define your table.
Step 0
CTEs have r
pre-aggregated amounts for the (accounts_id, date)
entire period of interest for better performance The table is scanned only once, the suggested index is indicated here (see Impact).
Step 1
In the inner subquery, d
generate the required list of days: 27 + _days
rows, where _days
is the desired number of rows in the output, effectively 28 days or more.
From there, calculate the day of the week ( dow
) that will be used for aggregation in step 3. i%7
matches weekly intervals, however, the query works for any interval.
In the inner subquery, a
generate a unique list of users ( accounts_id
) that exist in the CTE r
and will pass some of the first shallow tests (enough lines covering enough time with enough requests).
Step 2
Create a cartesian product of d
and a
s CROSS JOIN
to have one row for each relevant day for each relevant user . LEFT JOIN
before r
to add the number of requests (if any). No WHERE
, we want to get results every day, even if there are no active users.
Calculate last week's grand total ( w_amount
) in the same step using Custom Frame Window Functions . Example:
Step 3
Disable last 6 days; which is optional and may or may not help performance. Check it out:WHERE date >= _now - (21 + _days)
Count the weeks in which the minimum amount is executed ( w_ct
) in a similar window function, this time divided by dow
an additional one to have only the same days of the week for the last 4 weeks in the frame (which carry the amount corresponding to the past week). The expression count(w_amount > 10 OR NULL)
only counts rows with more than 10 queries. Detailed explanation:
Step 4
In the outer SELECT
band date
and count the people who have gone through all 4 weeks ( count(w_ct = 4 OR NULL)
). Add 1 to the date to compensate for the 1's ORDER
and LIMIT
up to the required number of days.
Performance and prospects
Ideal index for both queries:
CREATE INDEX foo ON requests (date, accounts_id, amount);
Performance should be good, but better (much better) with the upcoming Postgres 9.4 , thanks to new support for moving aggregation :
Moving-aggregate support in the Postgres Wiki.
Moving units in manual 9.4
Also: don't name the column timestamp
"date", it's a timestamp
, not date
. Better yet, never use base type names such as date
or timestamp
as an identifier. Someday.
source to share