You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Zac Zhou (JIRA)" <ji...@apache.org> on 2017/12/13 11:41:00 UTC
[jira] [Assigned] (HIVE-18270) count(distinct) using join and group
by produce incorrect output when hive.auto.convert.join=false and
hive.auto.convert.join.noconditionaltask=false
[ https://issues.apache.org/jira/browse/HIVE-18270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zac Zhou reassigned HIVE-18270:
-------------------------------
Assignee: Zac Zhou
Affects Version/s: 1.2.1
2.1.1
2.2.0
2.3.0
Description:
When I running the following query:
explain
SELECT foo.id, count(distinct foo.line_id) as factor from
foo JOIN bar ON (foo.id = bar.id)
WHERE foo.orders != 'blah'
group by foo.id;
The following error is got:
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.merge(ReduceSinkDeDuplication.java:216)
at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$JoinReducerProc.process(ReduceSinkDeDuplication.java:557)
at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.process(ReduceSinkDeDuplication.java:166)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication.transform(ReduceSinkDeDuplication.java:108)
at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:192)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10201)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:74)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
It looks like it is a bug of ReduceSinkDeDuplication optimizer.
Since the columns of count distinct need to be added into reduce key for sorting, the reducesink of group can't be replaced with the ones of join.
In the case of count distinct query, reducesink of group should not be merged
Summary: count(distinct) using join and group by produce incorrect output when hive.auto.convert.join=false and hive.auto.convert.join.noconditionaltask=false (was: count(distinct) using join and group by produce incorrect output when hive.auto.convert.join=false and hive.optimize.reducededuplication=true)
> count(distinct) using join and group by produce incorrect output when hive.auto.convert.join=false and hive.auto.convert.join.noconditionaltask=false
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-18270
> URL: https://issues.apache.org/jira/browse/HIVE-18270
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.2.1, 2.1.1, 2.2.0, 2.3.0
> Reporter: Zac Zhou
> Assignee: Zac Zhou
>
> When I running the following query:
> explain
> SELECT foo.id, count(distinct foo.line_id) as factor from
> foo JOIN bar ON (foo.id = bar.id)
> WHERE foo.orders != 'blah'
> group by foo.id;
> The following error is got:
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> at java.util.ArrayList.get(ArrayList.java:411)
> at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.merge(ReduceSinkDeDuplication.java:216)
> at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$JoinReducerProc.process(ReduceSinkDeDuplication.java:557)
> at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.process(ReduceSinkDeDuplication.java:166)
> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
> at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication.transform(ReduceSinkDeDuplication.java:108)
> at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:192)
> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10201)
> at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
> at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:74)
> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> It looks like it is a bug of ReduceSinkDeDuplication optimizer.
> Since the columns of count distinct need to be added into reduce key for sorting, the reducesink of group can't be replaced with the ones of join.
> In the case of count distinct query, reducesink of group should not be merged
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)