Cohort Analysis with Amazon Redshift / PostgresSQL
I am trying to analyze user retention using cohort analysis based on event data stored in Redshift.
For example, in Redshift I have:
timestamp action user id
--------- ------ -------
2015-05-05 12:00 homepage 1
2015-05-05 12:01 product page 1
2015-05-05 12:02 homepage 2
2015-05-05 12:03 checkout 1
I would like to extract the daily cohort. For example:
signup_day users_count d1 d2 d3 d4 d5 d6 d7
---------- ----------- -- -- -- -- -- -- --
2015-05-05 100 80 60 40 20 17 16 12
2015-05-06 150 120 90 60 30 22 18 15
Where signup_day
is the first date, we have a record of user actions, users_count
- the total number of users who have signed up for signup_day
, d1
- the number of users who have performed an action. day after signup_day
, etc.
Is there a better way to represent storage analysis data?
What would be the best query to achieve this with Amazon Redshift? Can I do it with one request?
source to share
I eventually found the request below to satisfy my requirements.
WITH
users AS (
SELECT
user_id,
date_trunc('day', min(timestamp)) as activated_at
from table
group by 1
)
,
events AS (
SELECT user_id,
action,
timestamp AS occurred_at
FROM table
)
SELECT DATE_TRUNC('day',u.activated_at) AS signup_date,
TRUNC(EXTRACT('EPOCH' FROM e.occurred_at - u.activated_At)/(3600*24)) AS user_period,
COUNT(DISTINCT e.user_id) AS retained_users
FROM users u
JOIN events e
ON e.user_id = u.user_id
AND e.occurred_at >= u.activated_at
WHERE u.activated_at >= getdate() - INTERVAL '11 day'
GROUP BY 1,2
ORDER BY 1,2
It produces a slightly different table than I described above (but better for my needs):
signup_date user_period retained_users
----------- ----------- --------------
2015-05-05 0 80
2015-05-05 1 60
2015-05-05 2 40
2015-05-05 3 20
2015-05-06 0 100
2015-05-06 1 80
2015-05-06 2 40
2015-05-06 3 20
source to share