You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "dario-liberman (via GitHub)" <gi...@apache.org> on 2023/06/07 17:34:02 UTC

[GitHub] [pinot] dario-liberman opened a new issue, #10866: Add support for funnel analytics

dario-liberman opened a new issue, #10866:
URL: https://github.com/apache/pinot/issues/10866

   A common use-case in real-time interactive analytic applications is [funnel analysis](https://en.wikipedia.org/wiki/Funnel_analysis).
   
   Example funnel analytics products to name a few:
   - [Mixpanel](https://mixpanel.com/analysis)
   - [Amplitude](https://amplitude.com/amplitude-analytics)
   - [Datadog](https://docs.datadoghq.com/real_user_monitoring/funnel_analysis/) 
   - [Google Marketing Platform](https://marketingplatform.google.com/)
   
   At the core of these solutions is the ability to correlate user events across steps, such as going thru a checkout funnel.
   A typical metric is the count of users reaching each step in a sequential funnel, or the respective conversion rates between steps (simply the ratio of such counts from one step to the next).
   
   The proposal is to build an aggregation function able to correlate events by a given column, such as the user id, the session id, etc.
   
   In order to simplify the initial implementation, the proposal is to start with a causality based funnel rather than requiring a strict sequence in time. Most user funnels in real-life can only be exercised forward, so this can solve for the vast majority of use-cases.
   What is meant by causality is that we will count towards step N only if we counted towards step N-1 for the same correlation id, but we would accept for them to happen out of order (in fact, that is often the case in real-life distributed user event logging systems anyways).
   
   In order to simplify the solution further, we can expect that the table is partitioned by the correlation column (eg. user id).
   
   Once these assumptions are accepted, then the solution looks very similar to what is done in SEGMENT_PARTITIONED_DISTINCT_COUNT, which uses roaring bitmaps under the hood. The first step in the funnel would exactly match that, the second step in the funnel is simply the intersection between the first roaring bitmap and the second, and so on.
   
   The following aggregation function API is proposed:
   
   ```
   SELECT
      dateTrunc('day', timestamp) AS ts,
      FUNNEL_COUNT(
         STEPS(url = '/addToCart', url = '/checkout', url = '/orderConfirmation')
         CORRELATED_BY(user)
       ) as step_counts
   FROM user_log
   GROUP BY 1
   ```
   
   This query would return for each day the number of users going thru the funnel defined by the steps above as an array of longs, one count per step. An outer query or the client application (eg. a grafana plugin) can compute conversion rates if necessary.
   
   A user may optimise the query by filtering relevant steps as follows:
   ```
   SELECT
      dateTrunc('day', timestamp) AS ts,
      FUNNEL_COUNT(
         STEPS(url = '/addToCart', url = '/checkout', url = '/orderConfirmation')
         CORRELATED_BY(user_id)
       ) as step_counts
   FROM user_log
   WHERE url in ('/addToCart',  '/checkout', '/orderConfirmation')
   GROUP BY 1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on issue #10866: Add support for funnel analytics

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on issue #10866:
URL: https://github.com/apache/pinot/issues/10866#issuecomment-1581805593

   I am not sure what the following means. Can you give an example or is this idea from some well known source (then can you add a reference to it?)
   ```
   What is meant by causality is that we will count towards step N+1 only if we counted towards step N for the same correlation id, but we would accept for them to happen out of order (in fact, that is often the case in real-life distributed user event logging systems anyways).
   
   In order to simplify the solution further, we can expect that the table is partitioned by the correlation column (eg. user id).
   ```
   For partition, which Pinot partition config are you referring to exactly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] dario-liberman commented on issue #10866: Add support for funnel analytics

Posted by "dario-liberman (via GitHub)" <gi...@apache.org>.
dario-liberman commented on issue #10866:
URL: https://github.com/apache/pinot/issues/10866#issuecomment-1583319291

   > I am not sure what the following means. 
   
   It means any temporal order will be accepted as converted.
   
   For a funnel measuring conversion of A->B->C
   1. we will count users as entering the funnel if they performed A,  
   2. converted to step 2 whenever they performed A and B, 
   3. and converted fully whenever they performed A and B and C. 
   
   A user who only performed A and C, or who performed B and C but not A, would not count as converted. 
   However, we will not require a strict temporal order between the events, so if we registered events with this temporal sequence: A,C,B - then the user would still count as converted as the user performed all the steps in the funnel.
   
   What I was saying is that in many real-life systems it happens that clocks are not synchronised, specially if having multiple sources (eg. mobile and backend events), so the simplification of counting in any order can also be a plus. 
   I was also pointing out that in most real-life scenarios there is no way to do C without doing B, for example there is no way to complete a checkout without starting a checkout in the firsts place, or without adding first an item to the cart.
   
   > Can you give an example or is this idea from some well known source (then can you add a reference to it?)
   
   Many funnel analytics solutions support "any order" conversion criteria. For example:
   - [Mixpanel Any Order](https://docs.mixpanel.com/docs/analysis/reports/funnels#ordering)
   - [Amplitude Any Order](https://help.amplitude.com/hc/en-us/articles/360052745632#h_6b432e9f-a491-439e-a2c6-005bc16bc6f6)
   
   I believe we can expand in future iterations on supporting temporal order constraints.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org