You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Xuefu Zhang (JIRA)" <ji...@apache.org> on 2013/12/12 17:15:07 UTC
[jira] [Commented] (HIVE-6021) Problem in GroupByOperator for handling distinct aggrgations

    [ https://issues.apache.org/jira/browse/HIVE-6021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846400#comment-13846400 ] 

Xuefu Zhang commented on HIVE-6021:
-----------------------------------

[~sunrui] Thanks for your contribution. Do you mind providing the following?

1. A test case similar to what you constructed to produce the problem?
2. A review board entry.

> Problem in GroupByOperator for handling distinct aggrgations
> ------------------------------------------------------------
>
>                 Key: HIVE-6021
>                 URL: https://issues.apache.org/jira/browse/HIVE-6021
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: Sun Rui
>            Assignee: Sun Rui
>         Attachments: HIVE-6021.1.patch
>
>
> Use the following test case with HIVE 0.12:
> {code:sql}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> set hive.map.aggr=false; 
> select count(key),count(distinct value) from src group by key;
> {code}
> We will get an ArrayIndexOutOfBoundsException from GroupByOperator:
> {code}
> java.lang.RuntimeException: Error in configuring object
> 	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> 	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> 	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> 	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:485)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
> Caused by: java.lang.reflect.InvocationTargetException
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> 	... 5 more
> Caused by: java.lang.RuntimeException: Reduce operator initialization failed
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:159)
> 	... 10 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:281)
> 	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377)
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:152)
> 	... 10 more
> {code}
> explain select count(key),count(distinct value) from src group by key;
> {code}
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         src 
>           TableScan
>             alias: src
>             Select Operator
>               expressions:
>                     expr: key
>                     type: int
>                     expr: value
>                     type: string
>               outputColumnNames: key, value
>               Reduce Output Operator
>                 key expressions:
>                       expr: key
>                       type: int
>                       expr: value
>                       type: string
>                 sort order: ++
>                 Map-reduce partition columns:
>                       expr: key
>                       type: int
>                 tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(KEY._col0)   // The parameter causes this problem
>                            ^^^^^^^^^^^                
>                 expr: count(DISTINCT KEY._col1:0._col0)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: int
>           mode: complete
>           outputColumnNames: _col0, _col1, _col2
>           Select Operator
>             expressions:
>                   expr: _col1
>                   type: bigint
>                   expr: _col2
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {code}
> The root cause is within GroupByOperator.initializeOp(). The method forgets to handle the case:
> For a query has distinct aggregations, there is an aggregation function has a parameter which is a groupby key column but not distinct key column.
> {code}
>         if (unionExprEval != null) {
>           String[] names = parameters.get(j).getExprString().split("\\.");
>           // parameters of the form : KEY.colx:t.coly
>           if (Utilities.ReduceField.KEY.name().equals(names[0])) {
>             String name = names[names.length - 2];
>             int tag = Integer.parseInt(name.split("\\:")[1]);
>             
>             ...
>             
>           } else {
>             // will be VALUE._COLx
>             if (!nonDistinctAggrs.contains(i)) {
>               nonDistinctAggrs.add(i);
>             }
>           }
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)