You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Chunwei Lei (Jira)" <ji...@apache.org> on 2022/06/30 03:10:00 UTC

[jira] [Commented] (HIVE-26221) Add histogram-based column statistics

    [ https://issues.apache.org/jira/browse/HIVE-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560801#comment-17560801 ] 

Chunwei Lei commented on HIVE-26221:
------------------------------------

Looking forward to this feature. BTW, is there any way to support {{string}}?

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>
> Hive does not support histogram statistics, which are particularly useful for skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a hard-coded value of 1/3 (see [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. Datasketches are small, stateful programs that process massive data-streams and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)