Cohort Analysis in SQL

Looking for some user base cohort analysis. We have 2 tables "users" and "sessions" where users and sessions have a "created_at" field. I want to formulate a query that gives a table of 7 by 7 numbers (with some spaces) that shows me: the number of users who were created on a specific day who also have a session created by y = (0..6 days ago), indicating that he returned that day.

created_at  d2  d3  d4
today       *   *   *
today-1     49  *   *
today-2     45  30  *
today-3     47  48  18
...

      

In this case, 47 users that were created today-3 returned today-2.

Can I accomplish this in one MySQL query? I can execute queries individually, so, but it would be very nice to have it all in one query.

SELECT `users`.* FROM `users` INNER JOIN `sessions` ON `sessions`.`user_id` = `users`.`id` WHERE `users`.`os` = 'ios' AND (`sessions`.`updated_at` BETWEEN '2013-01-16 08:00:00' AND '2013-01-17 08:00:00')

      

+4


source to share


3 answers


This seems like a tricky problem. No matter how difficult it seems to you or not, it's never a bad idea to start working on it from a lesser problem.

You could start, for example, with a query that returns all users (users only) that have been logged in in the last week, that is, starting on a six day day, as per your requirement:

SELECT *
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

      

The next step could be to group the results by date and count the rows in each group:

SELECT
  created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY created_at

      

If created_at

is datetime

or timestamp

, use DATE(created_at)

as grouping criterion:

SELECT
  DATE(created_at) AS created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY DATE(created_at)

      

However, you don't want absolute dates in the output, just relative ones, for example today

, today - 1 day

etc. In this case, you can use DATEDIFF()

which returns the number of days between two dates to produce (numeric) offsets from today and a group by those values:

SELECT
  DATEDIFF(CURDATE(), created_at) AS created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY DATE(created_at)

      

Your column created_at

will contain "dates" such as 0

, 1

etc. before 6

. Convert them to today

, today-1

etc. This is trivial and you will see it in the final request. However, so far we have reached the point where we need to take one step back (or perhaps it is half a step to the right), because we really don't need to count the users, but their return.So the actual working dataset from users

which is required for now, will be as follows:



SELECT
  id,
  DATEDIFF(CURDATE(), created_at) AS day_offset
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

      

We need user ids to join this rowset (the one to be obtained from) sessions

and we need it day_offset

as a grouping criterion.

Moving on, a similar conversion needs to be done on the table sessions

, and I won't go into details about that. Suffice it to say that the resulting query will be very identical to the last one, with two exceptions:

  • id

    is replaced by user_id

    ;

  • DISTINCT applies to the entire subset.

The reason for DISTINCT is to return at most one row per user and day: I understand that even though there are many sessions that a user might have on a particular day, you want to count them as one return. So, here's what comes out of sessions

:

SELECT DISTINCT
  user_id,
  DATEDIFF(CURDATE(), created_at) AS day_offset
FROM sessions
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

      

Now all that remains is to join the two views, apply grouping, and use conditional aggregation to get the results you want:

SELECT
  CONCAT('today', IFNULL(CONCAT('-', NULLIF(u.DayOffset, 0)), '')) AS created_at,
  SUM(s.DayOffset = 0) AS d0,
  SUM(s.DayOffset = 1) AS d1,
  SUM(s.DayOffset = 2) AS d2,
  SUM(s.DayOffset = 3) AS d3,
  SUM(s.DayOffset = 4) AS d4,
  SUM(s.DayOffset = 5) AS d5,
  SUM(s.DayOffset = 6) AS d6
FROM (
  SELECT
    id,
    DATEDIFF(CURDATE(), created_at) AS DayOffset
  FROM users
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY
) u
LEFT JOIN (
  SELECT DISTINCT
    user_id,
    DATEDIFF(CURDATE(), created_at) AS DayOffset
  FROM sessions
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY
) s
ON u.id = s.user_id
GROUP BY u.DayOffset
;

      

I have to admit that I haven't tested / debugged this, but if need be I will be happy to work with the sample data you provided after you provided it. :)

+15


source


This answer inverts the output table, which @Newy wanted cohorts to be rows instead of columns and use absolute dates instead of relative ones.

I was looking for a query that would give me something like this:

Date        d0  d1  d2  d3  d4  d5  d6
2016-11-03  3   1   0   0   0   0   0
2016-11-04  4   2   0   1   0   0   *
2016-11-05  7   0   1   1   0   *   *
2016-11-06  7   3   1   1   *   *   *
2016-11-07  13  5   1   *   *   *   *
2016-11-08  4   0   *   *   *   *   *
2016-11-09  1   *   *   *   *   *   *

      

I searched for the number of users who signed up for a specific date and then how many of those users returned after 1 day, after 2 days, etc. So on 2016-11-07 13 users signed up and had a session, then 5 of those users came back after 1 day, then one user came back after 2 days, etc.

I took the first subquery of @Andriy M's big query and modified it to give me the user's registration date, not days relative to the current date:

SELECT
    id,
    DATE(created_at) AS DayOffset
  FROM users
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY

      



Then I changed the LEFT JOIN subquery to look like this:

 SELECT DISTINCT
    sessions.user_id,
    DATEDIFF(sessions.created_at, user.created_at) AS DayOffset
    FROM sessions
    LEFT JOIN users ON (users.id = sessions.user_id)
    WHERE sessions.created_at >= CURDATE() - INTERVAL 6 DAY

      

I wanted the dayoffset not to be relative to the current date as in @ Andriy M's answer, but relative to the user's registration date. So I left the join on the users table to get the time the user logged in and made the date difference.

So the final request looks something like this:

SELECT u.DayOffset as Date,
  SUM(s.DayOffset = 0) AS d0,
  SUM(s.DayOffset = 1) AS d1,
  SUM(s.DayOffset = 2) AS d2,
  SUM(s.DayOffset = 3) AS d3,
  SUM(s.DayOffset = 4) AS d4,
  SUM(s.DayOffset = 5) AS d5,
  SUM(s.DayOffset = 6) AS d6
FROM (
 SELECT
    id,
    DATE(created_at) AS DayOffset
  FROM users
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY
) as u
LEFT JOIN (
    SELECT DISTINCT
    sessions.user_id,
    DATEDIFF(sessions.created_at, user.created_at) AS DayOffset
    FROM sessions
    LEFT JOIN users ON (users.id = sessions.user_id)
    WHERE sessions.created_at >= CURDATE() - INTERVAL 6 DAY
) as s
ON s.user = u.id
GROUP BY u.DayOffset

      

+1


source


Example of a group by month:

First, let's create a table of individual user actions (MONTH IN A MONTH):

SELECT 
    mu.created_timestamp AS cohort
    , mu.id AS user_id
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 1 AND l.user_id = mu.id) AS m1
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 2 AND l.user_id = mu.id) AS m2
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 3 AND l.user_id = mu.id) AS m3
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 4 AND l.user_id = mu.id) AS m4
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 5 AND l.user_id = mu.id) AS m5
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 6 AND l.user_id = mu.id) AS m6
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 7 AND l.user_id = mu.id) AS m7
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 8 AND l.user_id = mu.id) AS m8
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 9 AND l.user_id = mu.id) AS m9
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 10 AND l.user_id = mu.id) AS m10
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 11 AND l.user_id = mu.id) AS m11
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 12 AND l.user_id = mu.id) AS m12
FROM user mu 
WHERE mu.created_timestamp BETWEEN '2018-01-01 00:00:00' AND '2019-12-31 23:59:59'

      

Then, after this table, we calculate the individual user activity:

SELECT MONTH(c.cohort) AS cohort
       ,COUNT(c.user_id) AS signups
       ,SUM(c.m1) AS m1 
       ,SUM(c.m2) AS m2 
       ,SUM(c.m3) AS m3 
       ,SUM(c.m4) AS m4 
       ,SUM(c.m5) AS m5 
       ,SUM(c.m6) AS m6 
       ,SUM(c.m7) AS m7 
       ,SUM(c.m8) AS m8 
       ,SUM(c.m9) AS m9 
       ,SUM(c.m10) AS m10 
       ,SUM(c.m11) AS m11 
       ,SUM(c.m12) AS m12 
FROM (SELECT 
    mu.created_timestamp AS cohort
    , mu.id AS user_id
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 1 AND l.user_id = mu.id) AS m1
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 2 AND l.user_id = mu.id) AS m2
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 3 AND l.user_id = mu.id) AS m3
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 4 AND l.user_id = mu.id) AS m4
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 5 AND l.user_id = mu.id) AS m5
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 6 AND l.user_id = mu.id) AS m6
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 7 AND l.user_id = mu.id) AS m7
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 8 AND l.user_id = mu.id) AS m8
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 9 AND l.user_id = mu.id) AS m9
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 10 AND l.user_id = mu.id) AS m10
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 11 AND l.user_id = mu.id) AS m11
    ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 12 AND l.user_id = mu.id) AS m12
FROM user mu 
WHERE mu.created_timestamp BETWEEN '2018-01-01 00:00:00' AND '2019-12-31 23:59:59') AS c GROUP BY MONTH(cohort)

      

As a substitute for months, you can use days, in most cases, in other cases, analysis by other wise groups is used.

0


source







All Articles