You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "dario-liberman (via GitHub)" <gi...@apache.org> on 2023/07/12 13:42:35 UTC

[GitHub] [pinot] dario-liberman commented on pull request #10867: FUNNEL_COUNT Aggregation Function

dario-liberman commented on PR #10867:
URL: https://github.com/apache/pinot/pull/10867#issuecomment-1632555378

   > I am working on other aggregation strategies that do not require partitioning - [master...dario-liberman:pinot:funnel-strategies](https://github.com/apache/pinot/compare/master...dario-liberman:pinot:funnel-strategies)
   
   @kishoreg @chenboat - Finally had a chance to add tests and complete the PR for the remaining aggregation strategies - https://github.com/apache/pinot/pull/11092
   
   > > > > In order for this aggregation to work, does it require all the data to be partitioned by segments (i.e. all users show up in the same segment, and no user can be shared across segments)? That is the pre-requisite for `SEGMENT_PARTITIONED_DISTINCT_COUNT`
   > > > 
   > > > 
   > > > Yes. That is the pre-requisite to use the aggregation function. For realtime table, it needs the Kafka topic to be partitioned (eg., by user ids).
   > > 
   > > 
   > > this is probably not practical and we should consider fixing this. Even if the kafka topic is partitioned by the same user_id, there is not guarantee that all users will be part of same segment.
   > 
   > I shared above a work in progress branch with more funnel count aggregation strategies, effectively equivalents to DISTINCTCOUNT, DISTINCTCOUNTBITMAP and DISTINCTCOUNTTHETASKETCH. These do not depend on partitioning.
   > 
   > The strategy equivalent to SEGMENTPARTITIONEDDISTINCTCOUNT we have here is just a first version. When the column is configured as partition column we only have the same users across time boundaries between segments, which when grouping over time (eg per hour) to see funnel trends, gives good enough approximations. In the future it should be possible to incorporate a partition level (or server level?) phase so that we aggregate differently between segments within the same partition and segments across partitions. I will need more time for that though, for now I am adding different strategies so we can use the right one for each use case, as it will also depend on the sessionization window desired.
   
   Hopefully these aggregation strategies address the concerns here?
   
   Regarding the other concern about sub-function arguments not being real transform functions, that will be addressed in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org