You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kent Yao (Jira)" <ji...@apache.org> on 2020/12/04 06:22:00 UTC

[jira] [Updated] (SPARK-24607) Distribute by rand() can lead to data inconsistency

     [ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kent Yao updated SPARK-24607:
-----------------------------
    Labels: bulk-closed correctness  (was: bulk-closed)

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: SPARK-24607
>                 URL: https://issues.apache.org/jira/browse/SPARK-24607
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0, 2.3.1
>            Reporter: zenglinxi
>            Priority: Major
>              Labels: bulk-closed, correctness
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with hiveSQL which include  distribute by rand, data inconsistency may happen during failure tolerance operations. Since spark has similar failure tolerance mechanism, I think it's also an hidden serious problem in sparksql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org