You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rxin <gi...@git.apache.org> on 2014/08/26 07:39:30 UTC

[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/2127

    [SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI)

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark SPARK-3224

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2127
    
----
commit 1dd3eb5b849bb250f4251730c1afa722757b706b
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-26T05:38:42Z

    [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53527001
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19288/consoleFull) for   PR 2127 at commit [`effb1ce`](https://github.com/apache/spark/commit/effb1ce35857d04b03485dc37ff8064e835d4911).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2127#discussion_r16751008
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1045,31 +1045,38 @@ class DAGScheduler(
             stage.pendingTasks += task
     
           case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
    -        // Mark the stage that the reducer was in as unrunnable
             val failedStage = stageIdToStage(task.stageId)
    -        markStageAsFinished(failedStage, Some("Fetch failure"))
    -        runningStages -= failedStage
    -        // TODO: Cancel running tasks in the stage
    -        logInfo("Marking " + failedStage + " (" + failedStage.name +
    -          ") for resubmision due to a fetch failure")
    -        // Mark the map whose fetch failed as broken in the map stage
             val mapStage = shuffleToMapStage(shuffleId)
    +        // It is likely that we receive multiple FetchFailed for a single stage (because we have
    +        // multiple tasks running concurrently on different executors). In that case, it is possible
    +        // the fetch failure has already been handled by the executor.
    +        if (runningStages.contains(failedStage)) {
    +          markStageAsFinished(failedStage, Some("Fetch failure"))
    +          runningStages -= failedStage
    +          // TODO: Cancel running tasks in the stage
    +          logInfo("Marking " + failedStage + " (" + failedStage.name +
    +            ") for resubmision due to a fetch failure")
    +
    +          logInfo("The failed fetch was from " + mapStage + " (" + mapStage.name +
    +            "); marking it for resubmission")
    +          if (failedStages.isEmpty && eventProcessActor != null) {
    +            // Don't schedule an event to resubmit failed stages if failed isn't empty, because
    +            // in that case the event will already have been scheduled. eventProcessActor may be
    +            // null during unit tests.
    +            import env.actorSystem.dispatcher
    +            env.actorSystem.scheduler.scheduleOnce(
    +              RESUBMIT_TIMEOUT, eventProcessActor, ResubmitFailedStages)
    +          }
    +          failedStages += failedStage
    +          failedStages += mapStage
    +        }
    +
    +        // Mark the map whose fetch failed as broken in the map stage
             if (mapId != -1) {
               mapStage.removeOutputLoc(mapId, bmAddress)
               mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
    --- End diff --
    
    I'm still not sure this is right. If you receive a new failure event for a new map output, don't you still need to schedule a ResubmitFailedStages event? Since otherwise the new map stage might not include all of the failed map tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53458190
  
    Yeah - this makes sense. Just to be clear though, this doesn't change the existing logic except to surround it with an if-else, correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2127#discussion_r16696798
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1045,31 +1045,37 @@ class DAGScheduler(
             stage.pendingTasks += task
     
           case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
    -        // Mark the stage that the reducer was in as unrunnable
             val failedStage = stageIdToStage(task.stageId)
    -        markStageAsFinished(failedStage, Some("Fetch failure"))
    -        runningStages -= failedStage
    -        // TODO: Cancel running tasks in the stage
    -        logInfo("Marking " + failedStage + " (" + failedStage.name +
    -          ") for resubmision due to a fetch failure")
    -        // Mark the map whose fetch failed as broken in the map stage
    -        val mapStage = shuffleToMapStage(shuffleId)
    -        if (mapId != -1) {
    -          mapStage.removeOutputLoc(mapId, bmAddress)
    -          mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
    -        }
    -        logInfo("The failed fetch was from " + mapStage + " (" + mapStage.name +
    -          "); marking it for resubmission")
    -        if (failedStages.isEmpty && eventProcessActor != null) {
    -          // Don't schedule an event to resubmit failed stages if failed isn't empty, because
    -          // in that case the event will already have been scheduled. eventProcessActor may be
    -          // null during unit tests.
    -          import env.actorSystem.dispatcher
    -          env.actorSystem.scheduler.scheduleOnce(
    -            RESUBMIT_TIMEOUT, eventProcessActor, ResubmitFailedStages)
    +        // It is likely that we receive multiple FetchFailed for a single stage (because we have
    --- End diff --
    
    see the comment here explaining the problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53459271
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19219/consoleFull) for   PR 2127 at commit [`1dd3eb5`](https://github.com/apache/spark/commit/1dd3eb5b849bb250f4251730c1afa722757b706b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2127#discussion_r16732419
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1045,31 +1045,37 @@ class DAGScheduler(
             stage.pendingTasks += task
     
           case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
    -        // Mark the stage that the reducer was in as unrunnable
             val failedStage = stageIdToStage(task.stageId)
    -        markStageAsFinished(failedStage, Some("Fetch failure"))
    -        runningStages -= failedStage
    -        // TODO: Cancel running tasks in the stage
    -        logInfo("Marking " + failedStage + " (" + failedStage.name +
    -          ") for resubmision due to a fetch failure")
    -        // Mark the map whose fetch failed as broken in the map stage
    -        val mapStage = shuffleToMapStage(shuffleId)
    -        if (mapId != -1) {
    -          mapStage.removeOutputLoc(mapId, bmAddress)
    -          mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
    -        }
    -        logInfo("The failed fetch was from " + mapStage + " (" + mapStage.name +
    -          "); marking it for resubmission")
    -        if (failedStages.isEmpty && eventProcessActor != null) {
    -          // Don't schedule an event to resubmit failed stages if failed isn't empty, because
    -          // in that case the event will already have been scheduled. eventProcessActor may be
    -          // null during unit tests.
    -          import env.actorSystem.dispatcher
    -          env.actorSystem.scheduler.scheduleOnce(
    -            RESUBMIT_TIMEOUT, eventProcessActor, ResubmitFailedStages)
    +        // It is likely that we receive multiple FetchFailed for a single stage (because we have
    +        // multiple tasks running concurrently on different executors). In that case, it is possible
    +        // the fetch failure has already been handled by the executor.
    +        if (runningStages.contains(failedStage)) {
    +          markStageAsFinished(failedStage, Some("Fetch failure"))
    +          runningStages -= failedStage
    +          // TODO: Cancel running tasks in the stage
    +          logInfo("Marking " + failedStage + " (" + failedStage.name +
    +            ") for resubmision due to a fetch failure")
    +
    +          // Mark the map whose fetch failed as broken in the map stage
    +          val mapStage = shuffleToMapStage(shuffleId)
    +          if (mapId != -1) {
    +            mapStage.removeOutputLoc(mapId, bmAddress)
    +            mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
    --- End diff --
    
    Don't we want these two lines, even if the stage has already been marked as failed?  It seems like the new failure could be telling us about a different dead map output


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53458821
  
    Jenkibns, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53529562
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19288/consoleFull) for   PR 2127 at commit [`effb1ce`](https://github.com/apache/spark/commit/effb1ce35857d04b03485dc37ff8064e835d4911).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53502423
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19246/consoleFull) for   PR 2127 at commit [`3d3d356`](https://github.com/apache/spark/commit/3d3d3565f82f3551c2bd7c6eb6013b551179729c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53481805
  
    Jenkins, retest this please
    
    
    On Tue, Aug 26, 2014 at 10:31 AM, Reynold Xin <no...@github.com>
    wrote:
    
    > Jenkibns, retest this please.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/2127#issuecomment-53458821>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2127#discussion_r16757359
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1045,21 +1045,19 @@ class DAGScheduler(
             stage.pendingTasks += task
     
           case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
    -        // Mark the stage that the reducer was in as unrunnable
             val failedStage = stageIdToStage(task.stageId)
    -        markStageAsFinished(failedStage, Some("Fetch failure"))
    -        runningStages -= failedStage
    -        // TODO: Cancel running tasks in the stage
    -        logInfo("Marking " + failedStage + " (" + failedStage.name +
    -          ") for resubmision due to a fetch failure")
    -        // Mark the map whose fetch failed as broken in the map stage
             val mapStage = shuffleToMapStage(shuffleId)
    -        if (mapId != -1) {
    -          mapStage.removeOutputLoc(mapId, bmAddress)
    -          mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
    +        // It is likely that we receive multiple FetchFailed for a single stage (because we have
    +        // multiple tasks running concurrently on different executors). In that case, it is possible
    +        // the fetch failure has already been handled by the scheduler.
    +        if (runningStages.contains(failedStage)) {
    +          markStageAsFinished(failedStage, Some("Fetch failure"))
    +          runningStages -= failedStage
    +          // TODO: Cancel running tasks in the stage
    +          logInfo(s"Marking $failedStage (${failedStage.name}) for resubmision " +
    --- End diff --
    
    Shouldn't this be below? Other than this tiny thing this LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2127#discussion_r16757427
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1045,21 +1045,19 @@ class DAGScheduler(
             stage.pendingTasks += task
     
           case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
    -        // Mark the stage that the reducer was in as unrunnable
             val failedStage = stageIdToStage(task.stageId)
    -        markStageAsFinished(failedStage, Some("Fetch failure"))
    -        runningStages -= failedStage
    -        // TODO: Cancel running tasks in the stage
    -        logInfo("Marking " + failedStage + " (" + failedStage.name +
    -          ") for resubmision due to a fetch failure")
    -        // Mark the map whose fetch failed as broken in the map stage
             val mapStage = shuffleToMapStage(shuffleId)
    -        if (mapId != -1) {
    -          mapStage.removeOutputLoc(mapId, bmAddress)
    -          mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
    +        // It is likely that we receive multiple FetchFailed for a single stage (because we have
    +        // multiple tasks running concurrently on different executors). In that case, it is possible
    +        // the fetch failure has already been handled by the scheduler.
    +        if (runningStages.contains(failedStage)) {
    +          markStageAsFinished(failedStage, Some("Fetch failure"))
    +          runningStages -= failedStage
    +          // TODO: Cancel running tasks in the stage
    +          logInfo(s"Marking $failedStage (${failedStage.name}) for resubmision " +
    --- End diff --
    
    I'll tune up these log messages and merge this - thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53522050
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19271/consoleFull) for   PR 2127 at commit [`49282b3`](https://github.com/apache/spark/commit/49282b3575df62717f4795b66925fb5099079d63).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `"$FWDIR"/bin/spark-submit --class $CLASS "$`
      * `class ExternalSorter(object):`
      * `"$FWDIR"/bin/spark-submit --class $CLASS "$`
      * `protected class AttributeEquals(val a: Attribute) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2127#discussion_r16752571
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1045,31 +1045,38 @@ class DAGScheduler(
             stage.pendingTasks += task
     
           case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
    -        // Mark the stage that the reducer was in as unrunnable
             val failedStage = stageIdToStage(task.stageId)
    -        markStageAsFinished(failedStage, Some("Fetch failure"))
    -        runningStages -= failedStage
    -        // TODO: Cancel running tasks in the stage
    -        logInfo("Marking " + failedStage + " (" + failedStage.name +
    -          ") for resubmision due to a fetch failure")
    -        // Mark the map whose fetch failed as broken in the map stage
             val mapStage = shuffleToMapStage(shuffleId)
    +        // It is likely that we receive multiple FetchFailed for a single stage (because we have
    +        // multiple tasks running concurrently on different executors). In that case, it is possible
    +        // the fetch failure has already been handled by the executor.
    +        if (runningStages.contains(failedStage)) {
    +          markStageAsFinished(failedStage, Some("Fetch failure"))
    +          runningStages -= failedStage
    +          // TODO: Cancel running tasks in the stage
    +          logInfo("Marking " + failedStage + " (" + failedStage.name +
    +            ") for resubmision due to a fetch failure")
    +
    +          logInfo("The failed fetch was from " + mapStage + " (" + mapStage.name +
    +            "); marking it for resubmission")
    +          if (failedStages.isEmpty && eventProcessActor != null) {
    +            // Don't schedule an event to resubmit failed stages if failed isn't empty, because
    +            // in that case the event will already have been scheduled. eventProcessActor may be
    +            // null during unit tests.
    +            import env.actorSystem.dispatcher
    +            env.actorSystem.scheduler.scheduleOnce(
    +              RESUBMIT_TIMEOUT, eventProcessActor, ResubmitFailedStages)
    +          }
    +          failedStages += failedStage
    +          failedStages += mapStage
    +        }
    +
    +        // Mark the map whose fetch failed as broken in the map stage
             if (mapId != -1) {
               mapStage.removeOutputLoc(mapId, bmAddress)
               mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
    --- End diff --
    
    Is that a problem? I think the reduce stage retry will fail, leading to resubmission anyway?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53518853
  
    @kayousterhout  I looked into the racing condition we discussed offline. I think even in that case the scheduler is resilient. 
    
    If you look at `newOrUsedStage`, it gets the map output locs from MapOutputTracker this way:
    ```scala
          val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
          val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
          for (i <- 0 until locs.size) {
            stage.outputLocs(i) = Option(locs(i)).toList   // locs(i) will be null if missing
          }
    ```
    
    The Option there guards against null locations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53381095
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19198/consoleFull) for   PR 2127 at commit [`1dd3eb5`](https://github.com/apache/spark/commit/1dd3eb5b849bb250f4251730c1afa722757b706b).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53467728
  
    Also can you write a unit test for this? I think it should be pretty easy -- you can just check for doulby-receiving StageCompleted events.  Our lack of unit tests in the scheduler code has historically led to many bugs for new patches...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53377909
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19198/consoleFull) for   PR 2127 at commit [`1dd3eb5`](https://github.com/apache/spark/commit/1dd3eb5b849bb250f4251730c1afa722757b706b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53521422
  
    I think that's fine since that just fails the executor code, which will result in another retry?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53458835
  
    Yea it doesn't.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53507395
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19246/consoleFull) for   PR 2127 at commit [`3d3d356`](https://github.com/apache/spark/commit/3d3d3565f82f3551c2bd7c6eb6013b551179729c).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53519125
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19271/consoleFull) for   PR 2127 at commit [`49282b3`](https://github.com/apache/spark/commit/49282b3575df62717f4795b66925fb5099079d63).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2127#discussion_r16750961
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1045,31 +1045,38 @@ class DAGScheduler(
             stage.pendingTasks += task
     
           case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
    -        // Mark the stage that the reducer was in as unrunnable
             val failedStage = stageIdToStage(task.stageId)
    -        markStageAsFinished(failedStage, Some("Fetch failure"))
    -        runningStages -= failedStage
    -        // TODO: Cancel running tasks in the stage
    -        logInfo("Marking " + failedStage + " (" + failedStage.name +
    -          ") for resubmision due to a fetch failure")
    -        // Mark the map whose fetch failed as broken in the map stage
             val mapStage = shuffleToMapStage(shuffleId)
    +        // It is likely that we receive multiple FetchFailed for a single stage (because we have
    +        // multiple tasks running concurrently on different executors). In that case, it is possible
    +        // the fetch failure has already been handled by the executor.
    +        if (runningStages.contains(failedStage)) {
    +          markStageAsFinished(failedStage, Some("Fetch failure"))
    +          runningStages -= failedStage
    +          // TODO: Cancel running tasks in the stage
    +          logInfo("Marking " + failedStage + " (" + failedStage.name +
    +            ") for resubmision due to a fetch failure")
    +
    +          logInfo("The failed fetch was from " + mapStage + " (" + mapStage.name +
    +            "); marking it for resubmission")
    --- End diff --
    
    Can you combine this with the above now-very-redundant log msg?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53502011
  
    Pushed a new version. I agree we should add test for it, but that shouldn't block 1.1.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53467488
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19219/consoleFull) for   PR 2127 at commit [`1dd3eb5`](https://github.com/apache/spark/commit/1dd3eb5b849bb250f4251730c1afa722757b706b).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53377536
  
    cc @kayousterhout & @pwendell can you take a look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53522845
  
    Ah cool you're right!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53458206
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53520416
  
    That code looks like it's just setting the output locations for the map stage... what about the following case:
    
    (1) map stage runs
    (2) reduce stage starts
    (3) reduce task fails because map output A is lost
    (4) map stage is restarted , with a single task for output A
    (5) scheduler gets another message that a second reduce task failed because output B was missing.
    (6) map stage finishes, and new reduce stage is started
    (7) when the reduce stage tries to get the output locations, won't it get an exception because there's no location for output B?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/2127#issuecomment-53481896
  
    Oh oops these emails arrived in the wrong order -- I thought your request
    for testing had not been satisfied.  Sorry for the duplication!
    
    
    On Tue, Aug 26, 2014 at 1:10 PM, Kay Ousterhout <ka...@gmail.com>
    wrote:
    
    > Jenkins, retest this please
    >
    >
    > On Tue, Aug 26, 2014 at 10:31 AM, Reynold Xin <no...@github.com>
    > wrote:
    >
    >> Jenkibns, retest this please.
    >>
    >> —
    >> Reply to this email directly or view it on GitHub
    >> <https://github.com/apache/spark/pull/2127#issuecomment-53458821>.
    >>
    >
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3224] FetchFailed reduce stages should ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2127


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org