Request for daily requests with limited dates for several weeks

I am trying to find active users over time on a daily basis.

A user is active when they have made more than 10 requests per week for 4 consecutive weeks .

i.e. on October 31, 2014, a user is active if he has made more than 10 requests in total in the week between:

  • October 24 - October 30, 2014 AND
  • Oct 17-Oct 23, 2014 AND
  • Oct 10-Oct 16, 2014 AND
  • Oct 3-Oct 9, 2014

I have a table requests

:

CREATE TABLE requests (
  id text PRIMARY KEY, -- id of the request
  amount bigint,       -- sum of requests made by accounts_id to recipient_id,
                       -- aggregated on a daily basis based on "date"
  accounts_id text,    -- id of the user
  recipient_id text,   -- id of the recipient
  date timestamp       -- date that the request was made in YYYY-MM-DD
);

      

Examples of values:

INSERT INTO requests2
VALUES
    ('1',  19, 'a1', 'b1', '2014-10-05 00:00:00'),
    ('2',  19, 'a2', 'b2', '2014-10-06 00:00:00'),
    ('3',  85, 'a3', 'b3', '2014-10-07 00:00:00'),
    ('4',  11, 'a1', 'b4', '2014-10-13 00:00:00'),
    ('5',  2,  'a2', 'b5', '2014-10-14 00:00:00'),
    ('6',  50, 'a3', 'b5', '2014-10-15 00:00:00'),
    ('7',  787323, 'a1', 'b6', '2014-10-17 00:00:00'),
    ('8',  33, 'a2', 'b8', '2014-10-18 00:00:00'),
    ('9',  14, 'a3', 'b9', '2014-10-19 00:00:00'),
    ('10', 11, 'a4', 'b10', '2014-10-19 00:00:00'),
    ('11', 1628, 'a1', 'b11', '2014-10-25 00:00:00'),
    ('13', 101, 'a2', 'b11', '2014-10-25 00:00:00');

      

Output example:

Date       | # Active users
-----------+---------------
10-01-2014 | 600
10-02-2014 | 703
10-03-2014 | 891

      

Here I tried to find the number of active users on a specific date (e.g. 10-01-2014):

SELECT count(*)
FROM
  (SELECT accounts_id
   FROM requests
   WHERE "date" BETWEEN '2014-10-01'::date - interval '2 weeks' AND '2014-10-01'::date - interval '1 week'
   GROUP BY accounts_id HAVING sum(amount) > 10) week_1
JOIN
  (SELECT accounts_id
   FROM requests
   WHERE "date" BETWEEN '2014-10-01'::date - interval '3 weeks' AND '2014-10-01'::date - interval '2 week'
   GROUP BY accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
JOIN
  (SELECT accounts_id
   FROM requests
   WHERE "date" BETWEEN '2014-10-01'::date - interval '4 weeks' AND '2014-10-01'::date - interval '3 week'
   GROUP BY accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
JOIN
  (SELECT accounts_id
   FROM requests
   WHERE "date" BETWEEN '2014-10-01'::date - interval '5 weeks' AND '2014-10-01'::date - interval '4 week'
   GROUP BY accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id

      

Since this is just a request to get a number for 1 day, I need to get this number daily over time. I think the idea is to make a join to get the date, so I tried to do something like this:

SELECT week_1."Date_series",
       count(*)
FROM
  (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
          accounts_id
   FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
   WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '2 weeks' AND requests.date::date - interval '1 week'
   GROUP BY "Date_series",
            accounts_id HAVING sum(amount) > 10) week_1
JOIN
  (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
          accounts_id
   FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
   WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '3 weeks' AND requests.date::date - interval '2 week'
   GROUP BY "Date_series",
            accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
AND week_1."Date_series" = week_2."Date_series"
JOIN
  (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
          accounts_id
   FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
   WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '4 weeks' AND requests.date::date - interval '3 week'
   GROUP BY "Date_series",
            accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
AND week_2."Date_series" = week_3."Date_series"
JOIN
  (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
          accounts_id
   FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
   WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '5 weeks' AND requests.date::date - interval '4 week'
   GROUP BY "Date_series",
            accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
AND week_3."Date_series" = week_4."Date_series"
GROUP BY week_1."Date_series"

      

However, I think I am not getting the correct answer and I do not know why. Any tips / advice / pointers are greatly appreciated! :) :)

PS. I am using Postgres 9.3

+3


source to share


1 answer


Here's a long answer on how to keep your queries short. :)

Table

Based on my table (before you specified the table definition with different (odd!) Data types:

CREATE TABLE requests (
   id           int
 , accounts_id  int  -- (id of the user)
 , recipient_id int  -- (id of the recipient)
 , date         date -- (date that the request was made in YYYY-MM-DD)
 , amount       int  -- (# of requests by accounts_id for the day)
);

      

Active users for a given day

List of "active users" for one given day :

SELECT accounts_id
FROM  (
   SELECT w.w, r.accounts_id
   FROM  (
      SELECT w
           , day - 6 - 7 * w AS w_start
           , day     - 7 * w AS w_end   
      FROM  (SELECT '2014-10-31'::date - 1 AS day) d  -- effective date here
           , generate_series(0,3) w
      ) w
   JOIN   requests r ON r."date" BETWEEN w_start AND w_end
   GROUP  BY w.w, r.accounts_id
   HAVING sum(r.amount) > 10
   ) sub
GROUP  BY 1
HAVING count(*) = 4;

      

Step 1

In the innermost subqueryw

(for "week"), plot the boundaries of the four-week interest from a CROSS JOIN

given day - 1 with output generate_series(0-3)

.

To add / subtract days to / from date

(not from the timestamp!), Just add / subtract numbers integer

. The expression day - 7 * w

subtracts 0-3 times 7 days from the specified date, reaching the end dates for each week ( w_end

).
Count another 6 days (not 7!) From each to calculate the corresponding start ( w_start

).
Also, save the week number w

(0-3) for later aggregation.

Step 2

In a subquery,sub

join rows from requests

a set of 4 weeks where the date is between the start and end dates. GROUP BY

week number w

and accounts_id

.
Only weeks with more than 10 requests are cumulative.

Step 3

The externalSELECT

counts the number of weeks that each user ( accounts_id

) qualified. Must be 4 to qualify as an "active user"

Number of active users per day

It's dynamite .
Wrapped in a simple SQL function to simplify general use, but the query can be used on its own:

CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
  RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
   SELECT accounts_id, date, sum(amount)::int AS amount
   FROM   requests
   WHERE  date BETWEEN _now - (27 + _days) AND _now - 1
   GROUP  BY accounts_id, date
   )
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM  (
   SELECT accounts_id, date
        , count(w_amount > 10 OR NULL)
                         OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
                         ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
   FROM  (
      SELECT accounts_id, date, dow   
           , sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
                         ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
      FROM  (SELECT _now - i AS date, i%7 AS dow
             FROM   generate_series(1, 27 + _days) i) d -- period of interest
      CROSS  JOIN (
             SELECT accounts_id FROM r
             GROUP  BY 1
             HAVING count(*) > 3 AND sum(amount) > 39  -- enough rows & requests
             AND    max(date) > min(date) + 15) a      -- can cover 4 weeks
      LEFT   JOIN r USING (accounts_id, date)
      ) sub1
   WHERE date > _now - (22 + _days)  -- cut off 6 trailing days now - useful?
   ) sub2
GROUP  BY date
ORDER  BY date DESC
LIMIT  _days
$func$ LANGUAGE sql STABLE;

      

The function accepts any day ( _now

), "today" as the default and the number of days ( _days

) in the result, 3 by default. Call:

SELECT * FROM f_active_users('2014-10-31', 5);

      

Or without parameters to use default values:

SELECT * FROM f_active_users();

      

The approach is different from the first request .



SQL Fiddle with two queries and options to define your table.

Step 0

CTEs have r

pre-aggregated amounts for the (accounts_id, date)

entire period of interest for better performance The table is scanned only once, the suggested index is indicated here (see Impact).

Step 1

In the inner subquery, d

generate the required list of days: 27 + _days

rows, where _days

is the desired number of rows in the output, effectively 28 days or more.
From there, calculate the day of the week ( dow

) that will be used for aggregation in step 3. i%7

matches weekly intervals, however, the query works for any interval.

In the inner subquery, a

generate a unique list of users ( accounts_id

) that exist in the CTE r

and will pass some of the first shallow tests (enough lines covering enough time with enough requests).

Step 2

Create a cartesian product of d

and a

s CROSS JOIN

to have one row for each relevant day for each relevant user . LEFT JOIN

before r

to add the number of requests (if any). No WHERE

, we want to get results every day, even if there are no active users.

Calculate last week's grand total ( w_amount

) in the same step using Custom Frame Window Functions . Example:

Step 3

Disable last 6 days; which is optional and may or may not help performance. Check it out:WHERE date >= _now - (21 + _days)

Count the weeks in which the minimum amount is executed ( w_ct

) in a similar window function, this time divided by dow

an additional one to have only the same days of the week for the last 4 weeks in the frame (which carry the amount corresponding to the past week). The expression count(w_amount > 10 OR NULL)

only counts rows with more than 10 queries. Detailed explanation:

Step 4

In the outer SELECT

band date

and count the people who have gone through all 4 weeks ( count(w_ct = 4 OR NULL)

). Add 1 to the date to compensate for the 1's ORDER

and LIMIT

up to the required number of days.

Performance and prospects

Ideal index for both queries:

CREATE INDEX foo ON requests (date, accounts_id, amount);

      

Performance should be good, but better (much better) with the upcoming Postgres 9.4 , thanks to new support for moving aggregation :

Moving-aggregate support in the Postgres Wiki.
Moving units in manual 9.4

Also: don't name the column timestamp

"date", it's a timestamp

, not date

. Better yet, never use base type names such as date

or timestamp

as an identifier. Someday.

+6


source







All Articles