You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Rui Li (JIRA)" <ji...@apache.org> on 2018/05/23 12:15:00 UTC

[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

    [ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487148#comment-16487148 ] 

Rui Li commented on HIVE-19671:
-------------------------------

I haven't verified it but my guess is the issue happens with task failover. Suppose mappers of {{distribute by}} finish successfully. Then reducers start but fail to fetch shuffle data because some nodes hosting mapper output are lost. Then those mappers are retried. But since partition keys are randomly generated, the retried tasks can produce different partitions than the previous attempt, which leads to the inconsistency.

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: HIVE-19671
>                 URL: https://issues.apache.org/jira/browse/HIVE-19671
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand());
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)