You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Seonggon Namgung (Jira)" <ji...@apache.org> on 2023/04/17 09:12:00 UTC
[jira] [Assigned] (HIVE-27082) AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name

     [ https://issues.apache.org/jira/browse/HIVE-27082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Seonggon Namgung reassigned HIVE-27082:
---------------------------------------

    Assignee: Seonggon Namgung  (was: Sungwoo Park)

> AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-27082
>                 URL: https://issues.apache.org/jira/browse/HIVE-27082
>             Project: Hive
>          Issue Type: Improvement
>          Components: Standalone Metastore
>    Affects Versions: 3.1.3, 4.0.0-alpha-2
>            Reporter: Sungwoo Park
>            Assignee: Seonggon Namgung
>            Priority: Major
>              Labels: pull-request-available
>
> This JIRA deals with non-determinisitic behavior of Hive in generating DAGs.
> The non-determinstic behavior of Hive in generating DAGs is due to the logic in AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), as well as the disproportionate distribution of Nulls in HIVE_DEFAULT_PARTITION.
> Here is what is happening in the case of the TPC-DS dataset. Let us use web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a running example.
> In the course of running TPC-DS queries, Hive asks MetaStore about the column statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.
> --- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.
> --- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, almost twice as many.
> The first call to MetaStore returns the correct result, but all subsequent requests are likely to return the same result from the cache, irrespective of the inclusion of HIVE_DEFAULT_PARTITION. This is because AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same way as other partNames, and the difference in the size of partNames[] is just 1. The outcome depends on the duration of intervening queries, so everything is now non-deterministic.
> If a wrong value of numNulls is returned, Hive generates a different DAG which make takes much longer than the correct one. The problem is particularly pronounced here because of the huge number of nulls in HIVE_DEFAULT_PARTITION. It is ironic to see that the query optimizer is so efficient that a single wrong guess of numNulls creates a very inefficient DAG.
> Note that this behavior cannot be avoided by setting hive.metastore.aggregate.stats.cache.max.variance to zero because the difference in the number of partNames[] between the argument and the entry in the cache is just 1.
> So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION in a special way, by not returning the result in the cache if there is a difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should provide the use with an option to activate this feature).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)