You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jiang Xingbo (JIRA)" <ji...@apache.org> on 2018/01/24 23:30:00 UTC

[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss

    [ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338419#comment-16338419 ] 

Jiang Xingbo commented on SPARK-23207:
--------------------------------------

I'm working on this.

> Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
> ---------------------------------------------------------------
>
>                 Key: SPARK-23207
>                 URL: https://issues.apache.org/jira/browse/SPARK-23207
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Jiang Xingbo
>            Priority: Major
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined.
> The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data.
> The following code returns 931532, instead of 1000000:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
>     throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org