You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Johan Gustavsson (JIRA)" <ji...@apache.org> on 2016/01/04 02:57:39 UTC
[jira] [Commented] (HIVE-12664) Bug in reduce deduplication optimization causing ArrayOutOfBoundException

    [ https://issues.apache.org/jira/browse/HIVE-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080617#comment-15080617 ] 

Johan Gustavsson commented on HIVE-12664:
-----------------------------------------

[~ashutoshc], sorry for the late response but bellow is a stack trace:
{code}
15/12/22 03:09:13 ERROR ql.Driver: FAILED: IndexOutOfBoundsException Index: 1, Size: 1
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
	at java.util.ArrayList.rangeCheck(ArrayList.java:635)
	at java.util.ArrayList.get(ArrayList.java:411)
	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.merge(ReduceSinkDeDuplication.java:212)
	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$JoinReducerProc.process(ReduceSinkDeDuplication.java:547)
	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication$AbsctractReducerReducerProc.process(ReduceSinkDeDuplication.java:164)
	at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94)
	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:78)
	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:132)
	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:109)
	at org.apache.hadoop.hive.ql.optimizer.correlation.ReduceSinkDeDuplication.transform(ReduceSinkDeDuplication.java:107)
	at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:146)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9423)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
	at com.treasure_data.hadoop.hive.runner.QueryRunner.processQueryCmd(QueryRunner.java:453)
	at com.treasure_data.hadoop.hive.runner.QueryRunner.processCmd(QueryRunner.java:394)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
	at com.treasure_data.hadoop.hive.runner.QueryRunner.run(QueryRunner.java:313)
	at com.treasure_data.hadoop.hive.runner.QueryRunner$1.run(QueryRunner.java:192)
	at com.treasure_data.hadoop.hive.runner.QueryRunner$1.run(QueryRunner.java:190)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
	at com.treasure_data.hadoop.util.TDUtil.doAs(TDUtil.java:226)
	at com.treasure_data.hadoop.hive.runner.QueryRunner.main(QueryRunner.java:190)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
{code}


> Bug in reduce deduplication optimization causing ArrayOutOfBoundException
> -------------------------------------------------------------------------
>
>                 Key: HIVE-12664
>                 URL: https://issues.apache.org/jira/browse/HIVE-12664
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.1.1, 1.2.1
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: HIVE-12664-1.patch, HIVE-12664.1.patch, HIVE-12664.patch
>
>
> The optimisation check for reduce deduplication only checks the first child node for join -and the check itself also contains a major bug- causing ArrayOutOfBoundException no matter what.
> Sample data table form:
> ||time||user||host||path||referer||code||agent||size||method||
> |int|string|string|string|string|bigint|string|bigint|string|
> Sample query
> {code:sql}
> SELECT 
>   t1.host,
>   COUNT(DISTINCT t1.`date`) AS login_count,
>   MAX(t2.code) AS code,
>   unix_timestamp() AS time
> FROM (
>     SELECT 
>       HOST,
>       MIN(time) AS DATE
>     FROM
>       www_access
>     WHERE
>       HOST IS NOT NULL
>     GROUP BY
>       HOST
>   ) t1
> JOIN (
>     SELECT 
>       HOST,
>       MIN(time) AS code
>     FROM
>       www_access
>     WHERE
>       HOST IS NOT NULL
>     GROUP BY
>       HOST
>   ) t2
>   ON t1.host = t2.host
> GROUP BY
>   t1.host
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)