You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2014/10/19 20:26:33 UTC

[jira] [Updated] (HIVE-8517) When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression

     [ https://issues.apache.org/jira/browse/HIVE-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mostafa Mokhtar updated HIVE-8517:
----------------------------------
    Assignee: Mostafa Mokhtar

> When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8517
>                 URL: https://issues.apache.org/jira/browse/HIVE-8517
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 0.14.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Mostafa Mokhtar
>             Fix For: 0.14.0
>
>
> When joining on partition column number of partitions is used as NDV which gets overridden by StatsUtils.getColStatisticsFromExpression and the number of partitions used as NDV is replaced by number of rows which results in the same behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196. "Joining on partition columns with fetch column stats enabled results it very small CE which negatively affects query performance "
> This is the call stack.
> {code}
> StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc) line: 1001	
> StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>, NodeProcessorCtx, Object...) line: 1479	
> DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90	
> PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94	
> PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78	
> PreOrderWalker.walk(Node) line: 54	
> PreOrderWalker.walk(Node) line: 59	
> PreOrderWalker.walk(Node) line: 59	
> PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>) line: 109	
> AnnotateWithStatistics.transform(ParseContext) line: 78	
> TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248	
> TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>, Set<WriteEntity>) line: 120	
> TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>, HashSet<ReadEntity>, HashSet<WriteEntity>) line: 99	
> SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037	
> SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
> ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74	
> ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
> Driver.compile(String, boolean) line: 415	
> {code}
> Query
> {code}
> select
>   ss_item_sk item_sk, d_date, sum(ss_sales_price),
>   sum(sum(ss_sales_price))
>       over (partition by ss_item_sk order by d_date rows between unbounded preceding and current row) cume_sales
> from store_sales
>     ,date_dim
> where ss_sold_date_sk=d_date_sk
>   and d_month_seq between 1193 and 1193+11
>   and ss_item_sk is not NULL
> group by ss_item_sk, d_date
> {code}
> Plan 
> Notice in the Map join operator the number of rows drop from 82,510,879,939 to 36524 after the join.
> {code}
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Map 1 <- Map 4 (BROADCAST_EDGE)
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>       DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: ss_item_sk is not null (type: boolean)
>                   Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ss_item_sk is not null (type: boolean)
>                     Statistics: Num rows: 82510879939 Data size: 652315818272 Basic stats: COMPLETE Column stats: COMPLETE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk}
>                         1 {d_date_sk} {d_date} {d_month_seq}
>                       keys:
>                         0 ss_sold_date_sk (type: int)
>                         1 d_date_sk (type: int)
>                       outputColumnNames: _col1, _col12, _col22, _col26, _col28, _col29
>                       input vertices:
>                         1 Map 4
>                       Statistics: Num rows: 36524 Data size: 4163736 Basic stats: COMPLETE Column stats: COMPLETE
>                       Filter Operator
>                         predicate: (((_col22 = _col26) and _col29 BETWEEN 1193 AND 1204) and _col1 is not null) (type: boolean)
>                         Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE
>                         Select Operator
>                           expressions: _col1 (type: int), _col28 (type: string), _col12 (type: float)
>                           outputColumnNames: _col1, _col28, _col12
>                           Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE
>                           Group By Operator
>                             aggregations: sum(_col12)
>                             keys: _col1 (type: int), _col28 (type: string)
>                             mode: hash
>                             outputColumnNames: _col0, _col1, _col2
>                             Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE
>                             Reduce Output Operator
>                               key expressions: _col0 (type: int), _col1 (type: string)
>                               sort order: ++
>                               Map-reduce partition columns: _col0 (type: int), _col1 (type: string)
>                               Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE
>                               value expressions: _col2 (type: double)
>             Execution mode: vectorized
>         Map 4
>             Map Operator Tree:
>                 TableScan
>                   alias: date_dim
>                   filterExpr: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean)
>                   Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean)
>                     Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       key expressions: d_date_sk (type: int)
>                       sort order: +
>                       Map-reduce partition columns: d_date_sk (type: int)
>                       Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE
>                       value expressions: d_date (type: string), d_month_seq (type: int)
>                     Select Operator
>                       expressions: d_date_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
>                       Group By Operator
>                         keys: _col0 (type: int)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
>                         Dynamic Partitioning Event Operator
>                           Target Input: store_sales
>                           Partition key expr: ss_sold_date_sk
>                           Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
>                           Target column: ss_sold_date_sk
>                           Target Vertex: Map 1
>             Execution mode: vectorized
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: sum(VALUE._col0)
>                 keys: KEY._col0 (type: int), KEY._col1 (type: string)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
>                 Reduce Output Operator
>                   key expressions: _col0 (type: int), _col1 (type: string)
>                   sort order: ++
>                   Map-reduce partition columns: _col0 (type: int)
>                   Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
>                   value expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double)
>             Execution mode: vectorized
>         Reducer 3
>             Reduce Operator Tree:
>               Extract
>                 Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
>                 PTF Operator
>                   Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double), _wcol0 (type: double)
>                     outputColumnNames: _col0, _col1, _col2, _col3
>                     Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE
>                       table:
>                           input format: org.apache.hadoop.mapred.TextInputFormat
>                           output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)