You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zhichun Wu (JIRA)" <ji...@apache.org> on 2014/10/01 02:44:34 UTC

[jira] [Commented] (HIVE-8151) Dynamic partition sort optimization inserts record wrongly to partition when used with GroupBy

    [ https://issues.apache.org/jira/browse/HIVE-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154115#comment-14154115 ] 

Zhichun Wu commented on HIVE-8151:
----------------------------------

@ [~prasanth_j] , I find that the explain of the insert sql in the testcase diff a little when enable/disable this optimization. After digging into the code it seems that before applying NonBlockingOpDeDupProc optimization, there are three select operators in a row before FileSink operator. NonBlockingOpDeDupProc would try to deduplicate these select operators. Casting _col1 into int before writing to file is lost durning the deduplication process. More precisely, cSELExprNodeDesc  backtracks fails due to missing of columnExprMap :
{code}
ExprNodeDesc newPSELExprNodeDesc =
                ExprNodeDescUtils.backtrack(cSELExprNodeDesc, cSEL, pSEL);
{code}
Here I try to include the columnExprMap in SemanticAnalyzer#genConversionSelectOperator and the testcase passes.
Please correct me if I'm wrong.

> Dynamic partition sort optimization inserts record wrongly to partition when used with GroupBy
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8151
>                 URL: https://issues.apache.org/jira/browse/HIVE-8151
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.14.0, 0.13.1
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8151.1.patch, HIVE-8151.2.patch, HIVE-8151.3.patch, HIVE-8151.4.patch, HIVE-8151.5.patch, HIVE-8151.6.patch, HIVE-8151.7.patch, HIVE-8151.8.patch
>
>
> HIVE-6455 added dynamic partition sort optimization. It added startGroup() method to FileSink operator to look for changes in reduce key for creating partition directories. This method however is not reliable as the key called with startGroup() is different from the key called with processOp(). startGroup() is called with newly changed key whereas processOp() is called with previously aggregated key. This will result in processOp() writing the last row of previous group as the first row of next group. This happens only when used with group by operator.
> The fix is to not rely on startGroup() and do the partition directory creation in processOp() itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)