You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by CodingCat <gi...@git.apache.org> on 2014/03/25 19:46:47 UTC

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

GitHub user CodingCat opened a pull request:

    https://github.com/apache/spark/pull/228

    SPARK-732: eliminate duplicate update of the accumulator

    https://spark-project.atlassian.net/browse/SPARK-732?filter=-1
    
    In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive
    
    In this patch, I changed the way for the DAGScheduler to update the accumulator, 
    
    DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value>  pairs, in this way, even the stage is resubmitted, we check if the HashTable has contained such stageId, it refuses to record the pairs. 
    
    ---
    
    
    However, the existence of the duplicate update is also related to the cache, especially when we make some accumulator operation in the transformation....(the original JIRA also discussed this)
    
    e.g. 
    
    ```
    val a = sc.parallelize(List(1, 2, 3, 4))
    val b = a.map(i => {accum += 1; i + 1})
    val c = b.filter(i => i < 4)
    c.count() - accum is 4
    c.count() - accum is 8
    b.cache()
    c.count() - accum is still 8
    ```
    
    My question here is whether we should take care of the update in the transformation, or we assume that the user should be aware of the possible duplicate update 
    
    Glad to hear your advice

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/CodingCat/spark SPARK-732

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #228
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10949097
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -831,7 +840,13 @@ class DAGScheduler(
                         activeJobs -= job
                         resultStageToJob -= stage
                         markStageAsFinished(stage)
    -                    jobIdToStageIdsRemove(job.jobId)
    +                    jobIdToStageIdsRemove(job.jobId).foreach(stageId =>  {
    +                      //accumulator operations
    +                      for (accumValues <- stageIdToAccumulators(stageId)) {
    +                        Accumulators.add(accumValues._1, accumValues._2)
    +                      }
    +                      stageIdToAccumulators -= stageId
    +                    })
    --- End diff --
    
    Push this into another cleanup method - so that we can leverage it for other operations in future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10949250
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -925,6 +939,10 @@ class DAGScheduler(
             }
             failedStages += failedStage
             failedStages += mapStage
    +        if (runningStages.contains(failedStage)) {
    +          stageIdToAccumulators -= failedStage.id
    +        }
    +        runningStages -= failedStage
    --- End diff --
    
    if runningStages -= is executed above, the if condition will always return false.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39235975
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38734617
  
    I believe you still need to handle the DAGScheduler cases of: 1) locally run jobs (and their on-completion cleanup); and 2) aborted stages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39201633
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38643347
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13450/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10948883
  
    --- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala ---
    @@ -237,12 +240,7 @@ private object Accumulators {
       // TODO: Use soft references? => need to make readObject work properly then
       val originals = Map[Long, Accumulable[_, _]]()
       val localAccums = Map[Thread, Map[Long, Accumulable[_, _]]]()
    -  var lastId: Long = 0
    -
    -  def newId: Long = synchronized {
    -    lastId += 1
    -    lastId
    -  }
    +  var nextAccumID = new AtomicLong(0)
    --- End diff --
    
    ```
    private val nextAccumID = new AtomicLong(0)
    def newId(): Long = nextAccumID.getAndIncrement()
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39620404
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10995808
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -405,12 +407,14 @@ class DAGScheduler(
         independentStages.toSet
       }
     
    -  private def jobIdToStageIdsRemove(jobId: Int) {
    +  private def jobIdToStageIdsRemove(jobId: Int): Set[Int] = {
         if (!jobIdToStageIds.contains(jobId)) {
    -      logDebug("Trying to remove unregistered job " + jobId)
    +      logWarning("Trying to remove unregistered job " + jobId)
    +      return Set[Int]()
         } else {
    -      removeJobAndIndependentStages(jobId)
    +      val removedStages = removeJobAndIndependentStages(jobId)
           jobIdToStageIds -= jobId
    +      return removedStages
    --- End diff --
    
    Neither `return` is necessary, since the whole function is just `if (!jobIdToStageIds.contains(jobId)) Set[Int]() else removedStages`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39641216
  
    Hi, @pwendell , Thank you for your comments, here is my reply
    
    First, "whether accumulator value should be subject to change in the case of failure"...I think no, though during a long period, Spark runs in this way (this patch is actually resolving a very old TODO in DAGScheduler.scala)...I think accumulator is usually used to take the task which is more complicate than rdd.count/rdd.filter.count (Maybe I'm wrong), e.g. counting the sum of distance from the points to a potential center in K-means (see mllib), I think in this case, the health status of the cluster should be transparent to the user, i.e. the final result of K-means should be irrelevant to whether executor is lost, etc....
    
    Second, Good point, I can understand what the use scenario is, but do you mind providing more details like how to implement this in Spark? I guess this can be solved by providing a approximateValue API in Accumulator or SparkContext....
    
    Third, actually, this patch ensures that the value of the accumulator in a stage will only be available when this stage becomes independent (means that no job needs it any more), if a job finishes, and the other job still needs the certain stage, the accumulator value calculated in the stage will not be counted...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38988019
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13567/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39004142
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39620888
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39620408
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10949224
  
    --- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala ---
    @@ -277,4 +275,10 @@ private object Accumulators {
           }
         }
       }
    +
    +  def add(value: (Long, Any)): Unit = synchronized {
    +    if (originals.contains(value._1)) {
    +      originals(value._1).asInstanceOf[Accumulable[Any, Any]] += value._2
    +    }
    +  }
     }
    --- End diff --
    
    There is another add method which takes in a Map of values ... I guess that should be removed as part of this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10948812
  
    --- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala ---
    @@ -43,8 +44,10 @@ class Accumulable[R, T] (
         param: AccumulableParam[R, T])
       extends Serializable {
     
    -  val id = Accumulators.newId
    -  @transient private var value_ = initialValue // Current value on master
    +  val id = Accumulators.nextAccumID.get()
    +  Accumulators.nextAccumID.getAndIncrement
    --- End diff --
    
    Ok, this is buggy (parallel invocations to Accumulators.nextAccumID.get() return same value).
    Instead of exposing nextAccumID - make that an implementation detail of Accumulators.
    Keep the code same - and rework newId in Accumators to use nextAccumID.getAndIncrement()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39252199
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38881843
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39214800
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13631/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39285448
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39003471
  
    looks good @CodingCat ! just made a few minor points.
    excellent change !!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38607582
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39217332
  
    it's weird, in my mbp, all test cases are passed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38605061
  
    Hi, @JoshRosen mind reviewing this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10996813
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -831,7 +842,8 @@ class DAGScheduler(
                         activeJobs -= job
                         resultStageToJob -= stage
                         markStageAsFinished(stage)
    -                    jobIdToStageIdsRemove(job.jobId)
    +                    val stagesToRemove = jobIdToStageIdsRemove(job.jobId)
    +                    cleanup(stagesToRemove.asInstanceOf[Set[Any]])
    --- End diff --
    
    What `cleanup` is doing should either be moved within the `removeStage` method within `removeJobAndIndependentStages`, or that `removeStage` should be go into the outer scope and be combined with `cleanup`.  Personally, I don't much care for a generically named function like `cleanup` at the outer scope.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39621045
  
    can anyone help to retrigger the test? Jenkins failed to fetch remote repo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-52419470
  
    How about record the `(stageId,partitionId)` in `Accumulable `?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39007448
  
    Hi, Kay, I will think about it, and see if we can move accumulator related functionalities to tm entirely.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39207085
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39203409
  
    Hi, @mridulm I have modified code according to your suggestion, thank you!
    
    Hi, @kayousterhout , I thought about your suggestion. I totally agree on that the DAGScheduler does not need to know so many details about the task, in the current implementation, some code need to be refactored; but there are some difficulties to do it in TM
    
    For handling duplicate accumulator operation, I think TM is a good candidate to handle task-level duplications, i.e. speculative task case. However, some duplication comes from higher level (e.g. resubmitted stages) TaskScheduler create a new TM for every submitted stage, no matter it is a new one or a resubmitted one, and TM itself doesn't know what it is serving for, new one or resubmitted one.  
    
    Furthermore, to deduplicate the accumulation, we actually need the dependency information which is maintained by DAGScheduler; Here is an example (in the first test case I wrote in this patch)
    
    stage 0 depends on stage 1 and stage 2, we cannot accumulate the accumulators in stage 1 and stage 2 even they are finished, otherwise, if stage 0 is failed due to a FetchFailed for stage 2's result, stage 2 will be resubmitted, and then the accumulator in stage 2's tasks will be calculated for twice. 
    
    Instead, we can only calculate the accumulator value of tasks in a stage when we are sure about that the stage is successfully finished and not needed by any jobs (And this is what I did in the current implementation)
    
    So, I think accumulator is actually something related to dependency information and we really need the status maintained by DAGScheduler to achieve the goal
    
    Any suggestion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39283228
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39285450
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13658/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38986611
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39235976
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13634/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38987810
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13566/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38987421
  
    1. refactor the code, 
    2. clear stageIdtoAccumulator when the stage is aborted
    3. Hi, @markhamstra , I checked the code, locally running job seems not have the resubmitted chance? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39726296
  
    ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38642649
  
    Jenkins die again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39252176
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10949032
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -814,8 +819,12 @@ class DAGScheduler(
         event.reason match {
           case Success =>
             logInfo("Completed " + task)
    -        if (event.accumUpdates != null) {
    -          Accumulators.add(event.accumUpdates) // TODO: do this only if task wasn't resubmitted
    +        if (!stageIdToAccumulators.contains(stage.id) ||
    +          stageIdToAccumulators(stage.id).size < stage.numPartitions) {
    --- End diff --
    
    Null check missing ? Or gauranteed not to be null ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39283240
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39208548
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38988018
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39258700
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38987597
  
    Yes, you're right.  Sorry about that -- I was thinking about when cleanup of DAGScheduler data structures is needed, not whether you can get to each of those points via a path relevant to this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38681013
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13469/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38636442
  
    @kayousterhout @mridulm addressed the concurrency issue in Accumulator, handle speculative case in DAGScheduler (identify the speculative task by checking if there is an earlier task ran on the same partition)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38997862
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10951648
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -814,8 +819,12 @@ class DAGScheduler(
         event.reason match {
           case Success =>
             logInfo("Completed " + task)
    -        if (event.accumUpdates != null) {
    -          Accumulators.add(event.accumUpdates) // TODO: do this only if task wasn't resubmitted
    +        if (!stageIdToAccumulators.contains(stage.id) ||
    +          stageIdToAccumulators(stage.id).size < stage.numPartitions) {
    +          stageIdToAccumulators.getOrElseUpdate(stage.id, new ListBuffer[(Long, Any)])
    +          for ((id, value) <- event.accumUpdates) {
    +            stageIdToAccumulators(stage.id) += id -> value
    --- End diff --
    
    How does this work when tasks in a stage get retried?  For example, consider a stage with with 3 tasks, where first task 1 finishes, then task 2 finishes, then a duplicate, speculated version of task 2 finishes, and then task 3 finishes.  It seems like the accumulator values from task 2 will be added twice and the values from task 3 will never be added?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38672238
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39214799
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39004143
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13579/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39201623
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39229541
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39207086
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13629/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10949472
  
    --- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala ---
    @@ -43,8 +44,10 @@ class Accumulable[R, T] (
         param: AccumulableParam[R, T])
       extends Serializable {
     
    -  val id = Accumulators.newId
    -  @transient private var value_ = initialValue // Current value on master
    +  val id = Accumulators.nextAccumID.get()
    +  Accumulators.nextAccumID.getAndIncrement
    --- End diff --
    
    I am not sure why id would be called twice - it is initialized only once.
    Also, the id is just a unique identifier - we are not trying to ensure it is conserved, but the current code in PR is MT-unsafe w.r.t uniqueness requirement for id.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39002055
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13577/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10953086
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -814,8 +819,12 @@ class DAGScheduler(
         event.reason match {
           case Success =>
             logInfo("Completed " + task)
    -        if (event.accumUpdates != null) {
    -          Accumulators.add(event.accumUpdates) // TODO: do this only if task wasn't resubmitted
    +        if (!stageIdToAccumulators.contains(stage.id) ||
    +          stageIdToAccumulators(stage.id).size < stage.numPartitions) {
    +          stageIdToAccumulators.getOrElseUpdate(stage.id, new ListBuffer[(Long, Any)])
    +          for ((id, value) <- event.accumUpdates) {
    +            stageIdToAccumulators(stage.id) += id -> value
    --- End diff --
    
    Maybe make value of stageIdToAccumulators a Map ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10948593
  
    --- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala ---
    @@ -43,8 +44,10 @@ class Accumulable[R, T] (
         param: AccumulableParam[R, T])
       extends Serializable {
     
    -  val id = Accumulators.newId
    -  @transient private var value_ = initialValue // Current value on master
    +  val id = Accumulators.nextAccumID.get()
    +  Accumulators.nextAccumID.getAndIncrement
    --- End diff --
    
    Not sure what this is trying to do ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat closed the pull request at:

    https://github.com/apache/spark/pull/228


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38986609
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-40650644
  
    Hi, @markhamstra @mridulm @kayousterhout @pwendell  I was rethinking about this PR
    
    Sometimes user may use accumulator to measure the number of operations which were performed,  in this case, the accumulator value should be relevant to the re-execution...and for the case, like K-means in mllib, we should not make accumulator value change when the task is a resubmitted one...
    
    So shall we provide a parameter for the user to indicate what they want when they construct the accumulator?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38621311
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38624402
  
    hehe, Jenkins didn't work....I will address all of your comments , thank you @kayousterhout @mridulm 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38987809
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39000584
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39258701
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13643/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38637096
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38999183
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39620889
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13779/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r11094324
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -116,6 +116,9 @@ class DAGScheduler(
       private val metadataCleaner =
         new MetadataCleaner(MetadataCleanerType.DAG_SCHEDULER, this.cleanup, env.conf)
     
    +  // stageId -> (splitId -> (acumulatorId, accumulatorValue))
    +  val stageIdToAccumulators = new HashMap[Int, HashMap[Int, ListBuffer[(Long, Any)]]]
    --- End diff --
    
    Since this is mutable datastructure - can me make it private ?
    For testcase, expose some method which checks for the condition instead of needing to expose the entire stageIdToAccumulators ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39002548
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39285484
  
    Finally...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38681012
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38643346
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39570964
  
    @kayousterhout any feedback?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39002551
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38986828
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10948890
  
    --- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala ---
    @@ -43,8 +44,10 @@ class Accumulable[R, T] (
         param: AccumulableParam[R, T])
       extends Serializable {
     
    -  val id = Accumulators.newId
    -  @transient private var value_ = initialValue // Current value on master
    +  val id = Accumulators.nextAccumID.get()
    +  Accumulators.nextAccumID.getAndIncrement
    --- End diff --
    
    In the original implementation, id is an integer and increased with synchronized{} guard, just make it as an AtomicLong for convenience. Here the two lines are get the value and increment 
    
    Ah, I think I know why you are confused here, the more straightforward way seems to be 
    
    ```
    val id = Accumulator.nextAccumID.getAndIncrement
    ```
    
    I tried that, but I found that the id field will be called for two times during the initilaization(?) of a task, so for every time, the id is increased...
    
    I have to get the value when call id but increment in the constructor of Accumulable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39000586
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13573/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39004325
  
    Can we just add the accumulator update to TaskSetManager, in the handleSuccessfulTask() method?  This seems much simpler because the TaskSetManager already has all of the state about which tasks are running, which ones have been resubmitted or speculated, etc.  I think this change would be much simpler.
    
    Over time, a lot of functionality has leaked into the DAGScheduler, such that there's a lot of state that's kept in multiple places: in the DAGScheduler and in the TaskSetManager or the TaskSchedulerImpl.  The abstraction is supposed to be that the DAGScheduler handles the high level semantics of scheduling stages and dealing with inter-stage dependencies, and the TaskSetManager handles the low-level details of the tasks for each stage.  There are some parts of this abstraction that are currently broken (where the DAGScheduler knows too much about task-level details) and refactoring this is on my todo list, but in the meantime I think we should try not to make this problem any worse, because it makes the code much more complicated, more difficult to understand, and buggy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39628971
  
    This patch is a little confusing to me because I always thought the definition and semantics of an accumulator are that its value might be different in the case of failures. Maybe the issue is we don't document this clearly enough. Or maybe it's never been established as such.
    
    The approach taken here changes a property of accumulators - it makes it so the value of an accumulator cannot be accessed until the job is complete. This seems like a loss of functionality. E.g. I could imagine people using accumulators to decide when to terminate a stage (in the future) for things like approximate calculations where you want to stop executing at some point based on an accumulated value.
    
    The approach here also seems to assume that once a job finishes the stages couldn't ever be re-computed later on. Is that valid? For instance, if you have executor failures that cause shuffle output or RDD partitions to be lost, won't the scheduler need to re-compute partitions in a stage any ways? What if the tasks update accumulator values that are necessary for re-creating that data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39229530
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38637095
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-52419524
  
    I will resubmit the PR next week, as DAGScheduler has been changed a lot in the passed months


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38607581
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38997867
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38621313
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13441/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-732: eliminate duplicate update of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38986823
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39208568
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38999177
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39621115
  
    Jenkins, retest this please
    
    
    On Fri, Apr 4, 2014 at 4:26 PM, Nan Zhu <no...@github.com> wrote:
    
    > can anyone help to retrigger the test? Jenkins failed to fetch remote repo
    >
    > --
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/228#issuecomment-39621045>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10949381
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -925,6 +939,10 @@ class DAGScheduler(
             }
             failedStages += failedStage
             failedStages += mapStage
    +        if (runningStages.contains(failedStage)) {
    +          stageIdToAccumulators -= failedStage.id
    +        }
    +        runningStages -= failedStage
    --- End diff --
    
    Move everything then ? Or set a flag to see if it contained it or not ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-38672237
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39620206
  
    rebased


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r11094329
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -789,6 +799,25 @@ class DAGScheduler(
       }
     
       /**
    +   * detect the duplicate accumulator value and save the accumulator values
    +   * @param accumValue the accumulator values received from the task
    +   * @param stage the stage which the task belongs to
    +   * @param task the completed task
    +   */
    +  private def saveAccumulatorValue(accumValue: Map[Long, Any], stage: Stage, task: Task[_]) {
    +    if (accumValue != null &&
    +      (!stageIdToAccumulators.contains(stage.id) ||
    +        !stageIdToAccumulators(stage.id).contains(task.partitionId))) {
    +      stageIdToAccumulators.getOrElseUpdate(stage.id,
    +        new HashMap[Int, ListBuffer[(Long, Any)]]).
    +        getOrElseUpdate(task.partitionId, new ListBuffer[(Long, Any)])
    +      for ((id, value) <- accumValue) {
    +        stageIdToAccumulators(stage.id)(task.partitionId) += id -> value
    --- End diff --
    
    nit: you can avoid the lookup within the loop by using the value returned in the getOrElseUpdate calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39002053
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/228#discussion_r10949181
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -925,6 +939,10 @@ class DAGScheduler(
             }
             failedStages += failedStage
             failedStages += mapStage
    +        if (runningStages.contains(failedStage)) {
    +          stageIdToAccumulators -= failedStage.id
    +        }
    +        runningStages -= failedStage
    --- End diff --
    
    Any reason to reorder execution flow ? I am not sure what the impact of this is top of my head, hence the query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39007883
  
    Cool thanks!!
    
    
    On Sat, Mar 29, 2014 at 1:12 PM, Nan Zhu <no...@github.com> wrote:
    
    > Hi, Kay, I will think about it, and see if we can move accumulator related
    > functionalities to tm entirely.
    >
    > --
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/228#issuecomment-39007448>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---