You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alessandro Solimando (Jira)" <ji...@apache.org> on 2022/05/11 09:23:00 UTC

[jira] [Updated] (HIVE-26221) Add histogram-based column statistics

     [ https://issues.apache.org/jira/browse/HIVE-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Solimando updated HIVE-26221:
----------------------------------------
    Component/s: CBO
                 Metastore

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>
> Hive does not support histogram statistics, which are particularly useful for skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a hard-coded value of 1/3 (see [FilterSelectivityEstimator.java#L138-L144|[https://github.com/apache/hive/blob/4622860b8c7dbddaf4c556e65c5039c60da15e82/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]
> The current proposal aims at integrating histogram as an additional column statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. Datasketches are small, stateful programs that process massive data-streams and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution function (CDF) as underlying data structure for our histogram statistics.
> The current proposal only targets numeric data types (float, integer and numeric families), excluding string and temporal data types for the moment.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)