You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alessandro Solimando (Jira)" <ji...@apache.org> on 2022/06/30 09:13:00 UTC
[jira] [Comment Edited] (HIVE-26221) Add histogram-based column statistics

    [ https://issues.apache.org/jira/browse/HIVE-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560941#comment-17560941 ] 

Alessandro Solimando edited comment on HIVE-26221 at 6/30/22 9:12 AM:
----------------------------------------------------------------------

Thanks [~Chunwei Lei] for your interest, there is a WIP PR already which is almost ready for review (need to fix a conflict and update some test output files). I have linked it already to the ticket in case you want to take a look before it is finalized.

Regarding support for strings, these are the considerations we have made so far:
 * KLL sketches support only _float_: we could of course use an encoding respecting the lexicographical ordering of strings,
 * there is no general way to use KLL sketches for equality predicates: they are naturally tailored for range predicates, because for equality we need a notion of "immediate predecessor/successor" to get the cardinality of range _<pred(elem), elem>_ or _<elem, succ(elem)>_, and this is trivial only for data types mapping to the integer family (due to how [getCDF(float[] splitPoints)|https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/kll/KllFloatsSketch.html#getCDF-float:A-] method works),
 * strings seem to be more frequently involved in equality predicates, for which [ItemsSketch|https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html] is more suitable, we are exploring this angle in a parallel on-going project



was (Author: asolimando):
Thanks [~Chunwei Lei] for your interest, there is a WIP PR already which is almost ready for review (need to fix a conflict and update some test output files). I have linked it already to the ticket in case you want to take a look before it is finalized.

Regarding support for strings, these are the considerations we have made so far:
 * KLL sketches support only _float_: we could of course use an encoding respecting the lexicographical ordering of strings,
 * there is no general way to use KLL sketches for equality predicates: they are naturally tailored for range predicates, because for equality we need a notion of "immediate predecessor/successor" to get the cardinality of range _<pred(elem), elem>_ or _<elem, succ(elem)>_, and this is trivial only for data types mapping to the integer family (due to how [getCDF()|https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/kll/KllFloatsSketch.html#getCDF-float:A-] method works),
 * strings seem to be more frequently involved in equality predicates, for which [ItemsSketch|https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html] is more suitable, we are exploring this angle in a parallel on-going project


> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>
> Hive does not support histogram statistics, which are particularly useful for skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a hard-coded value of 1/3 (see [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. Datasketches are small, stateful programs that process massive data-streams and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)