You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kent Yao (Jira)" <ji...@apache.org> on 2020/12/04 06:18:00 UTC
[jira] [Comment Edited] (SPARK-24607) Distribute by rand() can lead to data inconsistency

    [ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243743#comment-17243743 ] 

Kent Yao edited comment on SPARK-24607 at 12/4/20, 6:17 AM:
------------------------------------------------------------

This could happen when the map stage retries, the same record that in the map task probably targets to different reduce tasks among task attempts.

 This could result in an incomplete result set when introducing a non-deterministic expression, e.g. rand(), in the jobs that need shuffle, e.g. aggregates, sort-merge join.

We may need a random but replayable function to handle these use cases because it is a common way that users use to deal with data skewness.

Otherwise, we may forbid non-deterministic functions to be used shuffle related operations.

cc [~cloud_fan] [~ulysses] [~maropu]


was (Author: qin yao):
This could happen when the map stage retries, the same record that in the map task probably targets to different reduce tasks among task attempts.

 his could result in an incomplete result set when introducing a non-deterministic expression, e.g. rand(), in the jobs that need shuffle, e.g. aggregates, sort-merge join.

We may need a random but replayable function to handle these use cases because it is a common way that users use to deal with data skewness.

Otherwise, we may forbid non-deterministic functions to be used shuffle related operations.

cc [~cloud_fan] [~ulysses] [~maropu]

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: SPARK-24607
>                 URL: https://issues.apache.org/jira/browse/SPARK-24607
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0, 2.3.1
>            Reporter: zenglinxi
>            Priority: Major
>              Labels: bulk-closed
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with hiveSQL which include  distribute by rand, data inconsistency may happen during failure tolerance operations. Since spark has similar failure tolerance mechanism, I think it's also an hidden serious problem in sparksql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org