You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mridulm <gi...@git.apache.org> on 2018/08/17 16:50:02 UTC

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r210963665
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag](
       // From performance concern, cache the value to avoid repeatedly compute `isBarrier()` on a long
       // RDD chain.
       @transient protected lazy val isBarrier_ : Boolean = dependencies.exists(_.rdd.isBarrier())
    +
    +  /**
    +   * Whether the RDD's computing function is idempotent. Idempotent means the computing function
    +   * not only satisfies the requirement, but also produce the same output sequence(the output order
    +   * can't vary) given the same input sequence. Spark assumes all the RDDs are idempotent, except
    +   * for the shuffle RDD and RDDs derived from non-idempotent RDD.
    +   */
    --- End diff --
    
    This will mean all rdd's which are directly or indirectly reading from an unsorted shuffle output are not 'idempotent'.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org