You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2020/07/01 07:23:00 UTC

[jira] [Created] (HIVE-23788) FilterStatsRule misestimate causes hashtable computation to rehash often

Rajesh Balamohan created HIVE-23788:
---------------------------------------

             Summary: FilterStatsRule misestimate causes hashtable computation to rehash often
                 Key: HIVE-23788
                 URL: https://issues.apache.org/jira/browse/HIVE-23788
             Project: Hive
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


Depending on available statistics, FilterStatsRule estimates the rows as numRows/3 at times. This causes, lower keyCount to be projected for hashtable computation causing rehashing often.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L952]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1192]

E.g TPCDS Q74 @ 10TB. But as part of evaluating "t_s_firstyear.year_total > 0, t_w_secyear.year_total / t_w_firstyear.year_total , t_s_secyear.year_total / t_s_firstyear.year_total " conditions, it projects 1/3rd of the rows causing rehashing of hashtable in downstream vertex.

May have to check whether stats can be projected for these columns correctly.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)