You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/13 15:42:59 UTC

[GitHub] [arrow-datafusion] jychen7 opened a new issue #2004: feat: ApproxPercentileCont supports sketches as input

jychen7 opened a new issue #2004:
URL: https://github.com/apache/arrow-datafusion/issues/2004


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Currently, `approx_quantile(column, quantile)` supports raw data as input and build sketches during query time.
   In the scenario of low latency query SLO, one common way is to pre-aggregate sketches during ingestion time (e.g. Spark/Flink -> DataStore), then merge sketches in query time (e.g. DataStore -> Datafusion).
   
   [Here](https://github.com/apache/druid/blob/master/extensions-contrib/tdigestsketch/src/test/resources/doubles_sketch_data.tsv) is an example from Druid which also use TDigest algorithm. The data contains `["timestamp", "product", "sketch"]"` and it is encoded using [TDigest Verbose mode](https://github.com/tdunning/t-digest/blob/5db477108a6a56cb385776d9aa1ce2e0fbd60230/core/src/main/java/com/tdunning/math/stats/MergingDigest.java#L869-L880)
   
   **Describe the solution you'd like**
   Improve `approx_quantile(column, quantile)` to accept an optional 3rd params, e.g. `approx_quantile(column, quantile, format)` where format can be
   - raw (default)
   - tdigest-verbose
   - tdigest-small
   - etc (for future sketch algo, e.g DDSketch from Datadog)
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   
   In the TDigest implementation of Datafusion, there is an encoding/serialization used internally.
   
   https://github.com/apache/arrow-datafusion/blob/ca952bd33402816dbb1550debb9b8cac3b13e8f2/datafusion-physical-expr/src/tdigest/mod.rs#L571-L582
   
   This encoding is a little bit different from Java one (from algo author)
   
   Datafusion: max_size, sum , count,  max,  min , centroid (mean, weight)
   Java: encoding_version, min, max, max_size, count, centroid (weight, mean)
   
   Question: do we want to modify Datafusion to align with the encoding for internal states? Or just do a mapping from Java one to Datafusion one during query?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #2004: feat: ApproxPercentileCont supports sketches from data source

Posted by GitBox <gi...@apache.org>.

alamb closed issue #2004:
URL: https://github.com/apache/arrow-datafusion/issues/2004


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jychen7 commented on issue #2004: feat: ApproxPercentileCont supports sketches from data source

Posted by GitBox <gi...@apache.org>.

jychen7 commented on issue #2004:
URL: https://github.com/apache/arrow-datafusion/issues/2004#issuecomment-1070061477


   > Improve approx_quantile(column, quantile) to accept an optional 3rd params, e.g. approx_quantile(column, quantile, format)
   
   I have implement `approx_percentile_cont_from_sketch` at my fork repo [approx_percentile_cont_from_sketch.rs](https://github.com/jychen7/arrow-datafusion/blob/2004-tdigest-sketches-from-data-source/datafusion-physical-expr/src/expressions/approx_percentile_cont_from_sketch.rs) with test case [tests/sql/aggregates.rs](https://github.com/jychen7/arrow-datafusion/blob/4ad1f0b2f1965f70aaaffccf0a37e33f989fccb7/datafusion/tests/sql/aggregates.rs#L505) and [similar csv sketch file](https://github.com/jychen7/arrow-datafusion/blob/2004-tdigest-sketches-from-data-source/sketch-testing/data/tdigest_sketch.csv) as Druid, but I find it is not elegant enough.
   
   So want to try another way to introduce `approx_percentile_cont_with_weight(column, weight_column, percentile)` similar to [Trino](https://trino.io/docs/current/functions/aggregate.html) `approx_percentile(x, w, percentage)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org