You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Rui Li (JIRA)" <ji...@apache.org> on 2018/06/20 06:41:00 UTC

[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

    [ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517846#comment-16517846 ] 

Rui Li commented on HIVE-19671:
-------------------------------

[~xuefuz], thanks for your input. I think rand(seed) may not work if the mapper's input is not in deterministic order. As an example, suppose a mapper needs to process key {{1, 2, 3, 4, 5}}. The partition in 1st attempt is as below:
{noformat}
key   rand(seed)
 1  ->   1
 2  ->   2
 3  ->   3
 4  ->   4
 5  ->   5
{noformat}
So there'll be 5 reducers to fetch data from this mapper. Suppose the first 4 reducers have finished. And when the 5th reducer starts, the node hosting the mapper's output is lost, so the mapper is rerun. And the 2nd attempt has the following partition:
{noformat}
key   rand(seed)
 1  ->   1
 3  ->   2
 5  ->   3
 2  ->   4
 4  ->   5
{noformat}
Then the 5th reducer is rerun and fetches key 4, which means key 4 is duplicated and key 5 is lost.

To avoid the issue, we need to make sure record reader can guarantee an order when reading data from HDFS, and we don't use shuffling that doesn't order the keys, e.g. groupByKey of Spark. What do you think?

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: HIVE-19671
>                 URL: https://issues.apache.org/jira/browse/HIVE-19671
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Assignee: Rui Li
>            Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)