You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/12/12 17:18:01 UTC

[GitHub] [pinot] t0mpere opened a new issue, #9971: Minion HLL Merge Rollup aggregation function

t0mpere opened a new issue, #9971:
URL: https://github.com/apache/pinot/issues/9971

   Hello everybody, My name is Tommaso and I'm a Data Platform Engineer at Bumble Inc. 
   I would like to contribute to Apache Pinot by implementing the HLL merge function as an Aggregation option for the Minion `MergeRollUpTask`
   
   ### What needs to be done?
   Currently the minion `MergeRollUpTask` only supports aggregation operations like: **Min, Max, Sum**
   I would like to add `HLLMerge` as an aggregation function, to aggregate serialised HLL estimators objects when performing rollup.
   
   ### Use cases
   At Bumble we rely on HLL for cardinality estimation on a large scale and it would be of great value if Apache Pinot could rollup segments that store serialised HLL counters in `BYTES` columns. This would allow us to have different time granularities in a single table allowing for longer retention and faster queries on big time windows that use HLL cardinality estimation.
   
   ### Initial Proposal
   
   The aggregation function will be a class implementing the `org.apache.pinot.core.segment.processing.aggregator.ValueAggregator` interface. This aggregation function will only support `BYTES` metric columns where `com.clearspring.analytics.stream.cardinality.HyperLogLog` objects are serialised. The operation will merge two existing HyperLogLog objects into a single resulting HyperLogLog object. 
   
   ### Limitations and Questions
   
   It could be useful to rollup column types like `STRING`, `INT`, `DOUBLE`, `LONG`, `MV` or `FLOAT` into a `BYTES` column with `HLL`. This kind of rollup is what we do already at Bumble with Apache Spark before ingesting data into Pinot. This would let users ingest **raw** data and then let the minions take care of the rollup with the option of having HLL distinct count as a metric.
   
   I'm afraid that this would not possible due to the fact that `MergeRollUpTask` can't change the type of a column to `BYTES` to store the HLL estimator. Has anyone ever thought about this?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] t0mpere closed issue #9971: Minion HLL Merge Rollup aggregation function

Posted by "t0mpere (via GitHub)" <gi...@apache.org>.
t0mpere closed issue #9971: Minion HLL Merge Rollup aggregation function
URL: https://github.com/apache/pinot/issues/9971


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] t0mpere commented on issue #9971: Minion HLL Merge Rollup aggregation function

Posted by "t0mpere (via GitHub)" <gi...@apache.org>.
t0mpere commented on issue #9971:
URL: https://github.com/apache/pinot/issues/9971#issuecomment-1511714710

   https://github.com/apache/pinot/pull/10328 This functionality was implemented here :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org