You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Mukul Sabharwal (Jira)" <ji...@apache.org> on 2020/09/25 17:26:00 UTC

[jira] [Commented] (PARQUET-42) Add HyperLogLog / CountMinSketch to parquet statistics

    [ https://issues.apache.org/jira/browse/PARQUET-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202319#comment-17202319 ] 

Mukul Sabharwal commented on PARQUET-42:
----------------------------------------

It would be nice standardize it. TDigest would also be very useful for quantile estimation. It is commutative and associative as well.

> Add HyperLogLog / CountMinSketch to parquet statistics
> ------------------------------------------------------
>
>                 Key: PARQUET-42
>                 URL: https://issues.apache.org/jira/browse/PARQUET-42
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Alex Levenson
>            Priority: Minor
>
> HLL and CMS for rowgroups could help with query planning (getting a sense of data skew) and with cheaply counting approximate distinct values. Both are commutative which means they can be combined across rowgroups (unlike an exact distinct count for example).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)