You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/09/20 00:28:00 UTC

[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

     [ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=810209&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-810209 ]

ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Sep/22 00:27
            Start Date: 20/Sep/22 00:27
    Worklog Time Spent: 10m 
      Work Description: github-actions[bot] commented on PR #3137:
URL: https://github.com/apache/hive/pull/3137#issuecomment-1251703711

   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.




Issue Time Tracking
-------------------

            Worklog Id:     (was: 810209)
    Remaining Estimate: 0h
            Time Spent: 10m

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a hard-coded value of 1/3 (see [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. Datasketches are small, stateful programs that process massive data-streams and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)