You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by tgravescs <gi...@git.apache.org> on 2018/08/07 19:43:59 UTC

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    Sorry for coming in late on this, first I saw this was the other day.  
    
    Could someone perhaps summarize the discussions here and exactly when this happens and why? Checkpointing was mentioned to work around the issue, why?  Would be good to add those details to the jira anyway. 
    
    My initial reaction is this is very bad.  Any correctness issue we cause from handle failures is not something we should write off and expect the user to handle. 
    repartition seems to be the most obvious case and I know lots of people use it, although hopefully many are using the dataframe api) and we see fetch failures on large jobs all the time, so it seems really serious.
    
    Trying to use a similar example as what is listed in jira SPARK-23207 with an RDD doesn't reproduce this:
    
    ```
    import scala.sys.process._
    
    import org.apache.spark.TaskContext
    val res = sc.parallelize(0 to (1000000-1), 1).repartition(200).map { x =>
      x
    }.repartition(200).map { x =>
      if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
        throw new Exception("pkill -f java".!!)
      }
      x
    }
    res.distinct().count()
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org