You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Vineet Garg (Jira)" <ji...@apache.org> on 2020/11/09 20:35:00 UTC
[jira] [Commented] (HIVE-24296) NDV adjusted twice causing reducer task underestimation

    [ https://issues.apache.org/jira/browse/HIVE-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228804#comment-17228804 ] 

Vineet Garg commented on HIVE-24296:
------------------------------------

bq. It would be good to remove static readjustment in StatsRuleProcFactory.
This may have adverse effect on other queries. The column statistics are being scaled according to estimated number of rows coming from JOIN. I guess we may try to remove it and experiment with tpcds/tpch queries to make sure that there is no regression.

[~rajesh.balamohan] [~jcamachorodriguez] What do you suggest? If there is a way to run tpch/tpcds queries to confirm I can remove the adjustment code and provide a debug jar.

> NDV adjusted twice causing reducer task underestimation
> -------------------------------------------------------
>
>                 Key: HIVE-24296
>                 URL: https://issues.apache.org/jira/browse/HIVE-24296
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2550]
>  
> {{StatsRuleProcFactory::updateColStats}}::
> {code:java}
>    	if (ratio <= 1.0) {
>           newDV = (long) Math.ceil(ratio * oldDV);
>         }
>         cs.setCountDistint(newDV);
> {code}
> Though RelHive* has the latest statistics, it is adjusted again {{StatsRuleProcFactory::updateColStats}} and it is done at linear scale.
>  
> Because of this, downstream vertex gets lesser number of tasks causing latency issues.
> E.g Q10 + TPCDS @10 TB scale. Attaching a snippet of "explain analyze" which shows stats underestimation.
> "Reducer 13" is underestimated 10x, when compared to runtime details. Projected NDV from RelHive* was around 65989699.
> However, due to the ratio calculation in StatsRuleProcFactory, it gets readjusted to ((948122598/14291978461) * 65989699)) ~= 4377723.
> It would be good to remove static readjustment in StatsRuleProcFactory.
> {noformat}
> Edges:
>         Map 10 <- Map 9 (BROADCAST_EDGE)
>         Map 12 <- Map 9 (BROADCAST_EDGE)
>         Map 2 <- Map 7 (BROADCAST_EDGE)
>         Map 8 <- Map 9 (BROADCAST_EDGE), Reducer 6 (BROADCAST_EDGE)
>         Reducer 11 <- Map 10 (SIMPLE_EDGE)
>         Reducer 13 <- Map 12 (SIMPLE_EDGE)
>         Reducer 3 <- Map 1 (BROADCAST_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE), Map 8 (CUSTOM_SIMPLE_EDGE), Reducer 11 (BROADCAST_EDGE), Reducer 13 (BROADCAST_EDGE)
>         Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
>         Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
>         Reducer 6 <- Map 2 (CUSTOM_SIMPLE_EDGE)
> Map 12
>             Map Operator Tree:
>                 TableScan
>                   alias: catalog_sales
>                   filterExpr: cs_ship_customer_sk is not null (type: boolean)
>                   Statistics: Num rows: 14327953968/552509183 Data size: 228959459440 Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: cs_ship_customer_sk is not null (type: boolean)
>                     Statistics: Num rows: 14291978461/551122492 Data size: 228384573968 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: cs_ship_customer_sk (type: bigint), cs_sold_date_sk (type: bigint)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 14291978461/551122492 Data size: 228384573968 Basic stats: COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0 _col1 (type: bigint)
>                           1 _col0 (type: bigint)
>                         outputColumnNames: _col0
>                         input vertices:
>                           1 Map 9
>                         Statistics: Num rows: 948122598/551122492 Data size: 7297899376 Basic stats: COMPLETE Column stats: COMPLETE
>                         Group By Operator
>                           keys: _col0 (type: bigint)
>                           minReductionHashAggr: 0.99
>                           mode: hash
>                           outputColumnNames: _col0
>                           Statistics: Num rows: 126954025/61576194 Data size: 977191880 Basic stats: COMPLETE Column stats: COMPLETE
>                           Reduce Output Operator
>                             key expressions: _col0 (type: bigint)
>                             null sort order: a
>                             sort order: +
>                             Map-reduce partition columns: _col0 (type: bigint)
>                             Statistics: Num rows: 126954025/61576194 Data size: 977191880 Basic stats: COMPLETE Column stats: COMPLETE
> ...
> ...
> Reducer 13
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 keys: KEY._col0 (type: bigint)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 4377725/40166690 Data size: 33696280 Basic stats: COMPLETE Column stats: COMPLETE
>                 Select Operator
>                   expressions: true (type: boolean), _col0 (type: bigint)
>                   outputColumnNames: _col0, _col1
>                   Statistics: Num rows: 4377725/40166690 Data size: 51207180 Basic stats: COMPLETE Column stats: COMPLETE
>                   Reduce Output Operator
>                     key expressions: _col1 (type: bigint)
>                     null sort order: a
>                     sort order: +
>                     Map-reduce partition columns: _col1 (type: bigint)
>                     Statistics: Num rows: 4377725/40166690 Data size: 51207180 Basic stats: COMPLETE Column stats: COMPLETE
>                     value expressions: _col0 (type: boolean)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)