You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Mukul Sabharwal (Jira)" <ji...@apache.org> on 2020/09/25 17:46:00 UTC
[jira] [Comment Edited] (PARQUET-42) Add HyperLogLog /
CountMinSketch to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202319#comment-17202319 ]
Mukul Sabharwal edited comment on PARQUET-42 at 9/25/20, 5:45 PM:
------------------------------------------------------------------
It would be nice to standardize it. TDigest would also be very useful for quantile estimation. It is commutative and associative as well.
was (Author: mjsabby):
It would be nice standardize it. TDigest would also be very useful for quantile estimation. It is commutative and associative as well.
> Add HyperLogLog / CountMinSketch to parquet statistics
> ------------------------------------------------------
>
> Key: PARQUET-42
> URL: https://issues.apache.org/jira/browse/PARQUET-42
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Alex Levenson
> Priority: Minor
>
> HLL and CMS for rowgroups could help with query planning (getting a sense of data skew) and with cheaply counting approximate distinct values. Both are commutative which means they can be combined across rowgroups (unlike an exact distinct count for example).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)