You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuanjian Li (Jira)" <ji...@apache.org> on 2019/08/20 04:10:00 UTC
[jira] [Comment Edited] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

    [ https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905250#comment-16905250 ] 

Yuanjian Li edited comment on SPARK-28699 at 8/20/19 4:09 AM:
--------------------------------------------------------------

-The current [approach|https://github.com/apache/spark/pull/25420] just a bandage fix for returning the wrong answer.-

After further investigation, we found that this bug is nothing to do with cache operation. So we focus on the sort + shuffle self and finally found the root cause is about the wrong usage for radix sort.

In original logic, we open the radix sort only depends on the config, and use the radix for the binary data comparison. It’s maybe OK for the dataset only has one column which is numeric, but during this case, binary format after transform “map\{ x => (x%1000, x)}” operation can’t be sorted by radix sort.

After the fix in [https://github.com/apache/spark/pull/25491] all tests passed with the right answer.

Also, find a corner case of DAGScheduler during the test is fixed separately in [https://github.com/apache/spark/pull/25491].

After we finish the work of indeterminate stage rerunning(SPARK-25341), we can fix this by unpersisting the original RDD and rerunning the cached indeterminate stage. Gives a preview codebase [here|https://github.com/xuanyuanking/spark/tree/SPARK-28699-RERUN].


was (Author: xuanyuan):
The current [approach|https://github.com/apache/spark/pull/25420] just a bandage fix for returning the wrong answer.

After we finish the work of indeterminate stage rerunning(SPARK-25341), we can fix this by unpersisting the original RDD and rerunning the cached indeterminate stage. Gives a preview codebase [here|https://github.com/xuanyuanking/spark/tree/SPARK-28699-RERUN].

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-28699
>                 URL: https://issues.apache.org/jira/browse/SPARK-28699
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Yuanjian Li
>            Priority: Major
>              Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 10000 * 10000, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org