Sequence Query with Amazon Redshift / PostgreSQL

I'm trying to parse a funnel using event data in Redshift and am having difficulty finding an efficient query to retrieve this data.

For example, in Redshift I have:

timestamp          action        user id
---------          ------        -------
2015-05-05 12:00   homepage      1
2015-05-05 12:01   product page  1
2015-05-05 12:02   homepage      2
2015-05-05 12:03   checkout      1

      

I would like to extract the funnel statistics. For example:

homepage_count  product_page_count  checkout_count
--------------  ------------------  --------------
100             50                  25

      

Where homepage_count

displays different number of users who visited the home page, product_page_count

represents the different numbers of users who visited the home page after , on the home page, and checkout_count

is the number of users who checked after visiting the home page and product page.

What would be the best query to achieve this with Amazon Redshift? Can I do it with one request?

+3


source to share


3 answers


I think the best method might be to add flags to the data for the first visit of each type for each user, and then use them for the aggregation logic:



select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
       sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
       sum(case when ts_checkout > ts.productpage and ts.productpage > ts.homepage then 1 else 0 end) as checkout_count
from (select userid,
             min(case when action = 'homepage' then timestamp end) as ts_homepage,
             min(case when action = 'product page' then timestamp end) as ts_productpage,
             min(case when action = 'checkout' then timestamp end) as ts_checkout
      from table t
      group by userid
     ) t

      

+4


source


The above answer is very correct. I changed it for people using it for AWS Mobile Analytics and Redshift.



 select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
   sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
   sum(case when ts_checkout > ts_productpage and ts_productpage > ts_homepage then 1 else 0 end) as checkout_count
from (select client_id,
         min(case when event_type = 'App Launch' then event_timestamp end) as ts_homepage,
         min(case when event_type = 'SignUp Success' then event_timestamp end) as ts_productpage,
         min(case when event_type = 'Start Quiz' then event_timestamp end) as ts_checkout
  from awsma.v_event
  group by client_id
 ) ts;

      

0


source


Just in case, a more accurate model is needed: when the product page can be opened twice. The first time before the home page and the second after. This case should usually also be viewed as a transformation.

SQL Redshift query:

SELECT
COUNT(
 DISTINCT CASE WHEN cur_homepage_time IS NOT NULL
 THEN user_id END
) Step1,
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL
  THEN user_id END
) Step2,
COUNT(
DISTINCT CASE WHEN
  cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL AND cur_checkout_time IS NOT NULL
  THEN user_id END
) Step3
FROM (
   SELECT
     user_id,
     timestamp,
     COALESCE(homepage_time,
              LAG(homepage_time) IGNORE NULLS OVER(PARTITION BY user_id
              ORDER BY time)
     ) cur_homepage_time,
     COALESCE(productpage_time,
              LAG(productpage_time) IGNORE NULLS OVER(PARTITION BY distinct_id
              ORDER BY time)
     ) cur_productpage_time,
     COALESCE(checkout_time,
              LAG(checkout_time) IGNORE NULLS OVER(PARTITION BY distinct_id
              ORDER BY time)
     ) cur_checkout_time
   FROM
     (
       SELECT
         timestamp,
         user_id,
         (CASE WHEN event = 'homepage'
           THEN timestamp END) homepage_time,
         (CASE WHEN event = 'product page'
           THEN timestamp END) productpage_time,
         (CASE WHEN event = 'checkout'
           THEN timestamp END) checkout_time
       FROM events
       WHERE timestamp > '2016-05-01' AND timestamp < '2017-01-01'
       ORDER BY user_id, timestamp
     ) event_times
   ORDER BY user_id, timestamp
 ) event_windows

      

This request fills in each line cur_homepage_time

, cur_productpage_time

and cur_checkout_time

a recent timestamp of occurrence of events. So in case of some specific time event (row reading), then the specific column is not NULL

.

More details here .

0


source







All Articles