You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2020/06/12 12:28:00 UTC

[jira] [Assigned] (HIVE-23684) Large underestimation in NDV stats when input and join cardinality ratio is big

     [ https://issues.apache.org/jira/browse/HIVE-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stamatis Zampetakis reassigned HIVE-23684:
------------------------------------------


> Large underestimation in NDV stats when input and join cardinality ratio is big
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-23684
>                 URL: https://issues.apache.org/jira/browse/HIVE-23684
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>
> Large underestimations of NDV values may occur after a join operation since the current logic will decrease the original NDV values proportionally.
> The [code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558] compares the number of rows of each relation before the join with the number of rows after the join and extracts a ratio for each side. Based on this ratio it adapts (reduces) the NDV accordingly.
> Consider for instance the following query:
> {code:sql}
> select inv_warehouse_sk
>      , inv_item_sk
>      , stddev_samp(inv_quantity_on_hand) stdev
>      , avg(inv_quantity_on_hand)         mean
> from inventory
>    , date_dim
> where inv_date_sk = d_date_sk
>   and d_year = 1999
>   and d_moy = 2
> group by inv_warehouse_sk, inv_item_sk;
> {code}
> For the sake of the discussion, I outline below some relevant stats (from TPCDS30tb):
>  T(inventory) = 1627857000
>  T(date_dim) = 73049
>  T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
>  V(inventory, inv_date_sk) = 261
>  V(inventory, inv_item_sk) = 420000
>  V(inventory, inv_warehouse_sk) = 27
>  V(date_dim, inv, d_date_sk) = 73049
> For instance, in this query the join between inventory and date_dim has ~24M rows while inventory has ~1.5B so the NDV of the columns coming from inventory are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K while the real one is 231000.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)