You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2020/07/01 07:23:00 UTC
[jira] [Created] (HIVE-23788) FilterStatsRule misestimate causes
hashtable computation to rehash often
Rajesh Balamohan created HIVE-23788:
---------------------------------------
Summary: FilterStatsRule misestimate causes hashtable computation to rehash often
Key: HIVE-23788
URL: https://issues.apache.org/jira/browse/HIVE-23788
Project: Hive
Issue Type: Improvement
Reporter: Rajesh Balamohan
Depending on available statistics, FilterStatsRule estimates the rows as numRows/3 at times. This causes, lower keyCount to be projected for hashtable computation causing rehashing often.
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L952]
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1192]
E.g TPCDS Q74 @ 10TB. But as part of evaluating "t_s_firstyear.year_total > 0, t_w_secyear.year_total / t_w_firstyear.year_total , t_s_secyear.year_total / t_s_firstyear.year_total " conditions, it projects 1/3rd of the rows causing rehashing of hashtable in downstream vertex.
May have to check whether stats can be projected for these columns correctly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)