You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Navis (JIRA)" <ji...@apache.org> on 2014/05/12 09:43:15 UTC

[jira] [Comment Edited] (HIVE-7012) Wrong RS de-duplication in the ReduceSinkDeDuplication Optimizer

    [ https://issues.apache.org/jira/browse/HIVE-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994895#comment-13994895 ] 

Navis edited comment on HIVE-7012 at 5/12/14 7:42 AM:
------------------------------------------------------

[~ashutoshc] Yes, it's intended. In the query ppd2.q
{code}
select a.*
  from (
    select key, count(value) as cc
    from srcpart a
    where a.ds = '2008-04-08' and a.hr = '11'
    group by key
  )a
  distribute by a.key
  sort by a.key,a.cc desc
{code}
cc is generated field by GBY operator, so It's semantically wrong to merge the RS for GBY with any following RS. But the same time, sort on "a.cc" is meaningless so it can be removed in optimizing, but not in here (maybe in SemanticAnalyzer?).

[~sunrui] Yes, RS for distinct should be avoided from any dedup process. Could you take this issue? I think you knows better than me.


was (Author: navis):
[~ashutoshc] Yes, it's intended. In the query ppd2.q
{code}
select a.*
  from (
    select key, count(value) as cc
    from srcpart a
    where a.ds = '2008-04-08' and a.hr = '11'
    group by key
  )a
  distribute by a.key
  sort by a.key,a.cc desc
{code}
cc is generated field by GBY operator, so It's semantically wrong to merged RS for GBY with following RS. But the same time, sort on "a.cc" is meaningless so it can be removed in optimizing, but not in here (maybe in SemanticAnalyzer?).

[~sunrui] Yes, RS for distinct should be avoided from any dedup process. Could you take this issue? I think you knows better than me.

> Wrong RS de-duplication in the ReduceSinkDeDuplication Optimizer
> ----------------------------------------------------------------
>
>                 Key: HIVE-7012
>                 URL: https://issues.apache.org/jira/browse/HIVE-7012
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.13.0
>            Reporter: Sun Rui
>            Assignee: Navis
>         Attachments: HIVE-7012.1.patch.txt, HIVE-7012.2.patch.txt
>
>
> With HIVE 0.13.0, run the following test case:
> {code:sql}
> create table src(key bigint, value string);
> select  
>    count(distinct key) as col0
> from src
> order by col0;
> {code}
> The following exception will be thrown:
> {noformat}
> java.lang.RuntimeException: Error in configuring object
> 	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> 	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> 	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> 	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:485)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.reflect.InvocationTargetException
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> 	... 9 more
> Caused by: java.lang.RuntimeException: Reduce operator initialization failed
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:173)
> 	... 14 more
> Caused by: java.lang.RuntimeException: cannot find field _col0 from [0:reducesinkkey0]
> 	at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415)
> 	at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:150)
> 	at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:79)
> 	at org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:288)
> 	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:376)
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:166)
> 	... 14 more
> {noformat}
> This issue is related to HIVE-6455. When hive.optimize.reducededuplication is set to false, then this issue will be gone.
> Logical plan when hive.optimize.reducededuplication=false;
> {noformat}
> src 
>   TableScan (TS_0)
>     alias: src
>     Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>     Select Operator (SEL_1)
>       expressions: key (type: bigint)
>       outputColumnNames: key
>       Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>       Group By Operator (GBY_2)
>         aggregations: count(DISTINCT key)
>         keys: key (type: bigint)
>         mode: hash
>         outputColumnNames: _col0, _col1
>         Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>         Reduce Output Operator (RS_3)
>           istinctColumnIndices:
>           key expressions: _col0 (type: bigint)
>           DistributionKeys: 0
>           sort order: +
>           OutputKeyColumnNames: _col0
>           Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>           Group By Operator (GBY_4)
>             aggregations: count(DISTINCT KEY._col0:0._col0)
>             mode: mergepartial
>             outputColumnNames: _col0
>             Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>             Select Operator (SEL_5)
>               expressions: _col0 (type: bigint)
>               outputColumnNames: _col0
>               Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>               Reduce Output Operator (RS_6)
>                 key expressions: _col0 (type: bigint)
>                 DistributionKeys: 1
>                 sort order: +
>                 OutputKeyColumnNames: reducesinkkey0
>                 OutputVAlueColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>                 value expressions: _col0 (type: bigint)
>                 Extract (EX_7)
>                   Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>                   File Output Operator (FS_8)
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.TextInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> {noformat}
> You will see that RS_3 and RS_6 are not merged.
> Logical plan when hive.optimize.reducededuplication=true;
> {noformat}
> src 
>   TableScan (TS_0)
>     alias: src
>     Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>     Select Operator (SEL_1)
>       expressions: key (type: bigint)
>       outputColumnNames: key
>       Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>       Group By Operator (GBY_2)
>         aggregations: count(DISTINCT key)
>         keys: key (type: bigint)
>         mode: hash
>         outputColumnNames: _col0, _col1
>         Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>         Reduce Output Operator (RS_3)
>           istinctColumnIndices:
>           key expressions: _col0 (type: bigint)
>           DistributionKeys: 1
>           sort order: +
>           OutputKeyColumnNames: reducesinkkey0
>           Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
>           Group By Operator (GBY_4)
>             aggregations: count(DISTINCT KEY._col0:0._col0)
>             mode: mergepartial
>             outputColumnNames: _col0
>             Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>             Select Operator (SEL_5)
>               expressions: _col0 (type: bigint)
>               outputColumnNames: _col0
>               Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>               File Output Operator (FS_8)
>                 compressed: false
>                 Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> {noformat}
> You will see that RS_6 has been merged into RS_3. However, Obviously the merge is incorrect because RS_3 and RS_6 have different sort keys. (The sort key for RS_3 is
> key and the sort key for RS_6 is count(distinct key)).
> The problem is that the method sameKeys() returns the result that both RS have same keys. sameKeys() depends ExprNodeDescUtils.backtrack() to backtrack a key expr of cRS to pRS.
> I don't understand the logical behind the following logic in ExprNodeDescUtils: 
>   Why still backtrack when there is no mapping for the column of the current operator?
> {code}
>   private static ExprNodeDesc backtrack(ExprNodeColumnDesc column, Operator<?> current,
>       Operator<?> terminal) throws SemanticException {
>     ...
>     if (mapping == null || !mapping.containsKey(column.getColumn())) {
>       return backtrack((ExprNodeDesc)column, current, terminal);
>     }
>     ...
>   }
> {code}
> The process of backtracking _col0 of cRS to pRS:
> RS_6:_col0 --> SEL_5:_col0 --> GBY_4:_col0 (because the colExprMap is null for GBY_4) --> RS_3:_col0 (No mapping for output column _col0), which is a wrong backtrack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)