You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sitalkedia <gi...@git.apache.org> on 2017/05/30 22:26:22 UTC

[GitHub] spark pull request #18150: Cleanup shuffle

GitHub user sitalkedia opened a pull request:

    https://github.com/apache/spark/pull/18150

    Cleanup shuffle

    ## What changes were proposed in this pull request?
    
    Currently, when we detect fetch failure, we only remove the shuffle files produced by the executor, while the host itself might be down and all the shuffle files are not accessible. In case we are running multiple executors on a host, any host going down currently results in multiple fetch failures and multiple retries of the stage, which is very inefficient. If we remove all the shuffle files on that host, on first fetch failure, we can rerun all the tasks on that host in a single stage retry.
    
    ## How was this patch tested?
    
    Unit testing and also ran a job on the cluster and made sure multiple retries are gone.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sitalkedia/spark cleanup_shuffle

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18150
    
----
commit 74ca88bc1d2b67cc12ea32a3cd344ec0259500a9
Author: Sital Kedia <sk...@fb.com>
Date:   2017-02-25T00:35:00Z

    [SPARK-19753][CORE] All shuffle files on a host should be removed in case of fetch failure or slave lost

commit 6898c2bb0f8a65dcc488e53b248fbeaec64efdb8
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-01T02:03:55Z

    Do not un-register shuffle files in case of executor lost

commit 32a2315caa07a5a6be1bd92ec1e13500b74308cb
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-01T02:13:07Z

    no-op when external shuffle service is disabled

commit c7c3129dcc4ad2fc1a75bff5a941f6c4a8dfd0ef
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-01T02:23:59Z

    fix check style

commit f96ec68d6922fe2108c5869fedf2d8aca373c6eb
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-16T22:24:43Z

    Addressed review comments and fixed a bug

commit d4979e35137152db00c53ea0b9e82aaf41dad5b5
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-16T23:00:17Z

    Fix build

commit 8787db1679c5b468afa3d2ede64eee53908fa5de
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-17T02:37:44Z

    Fix test failures

commit 4ca9527a8cf78ba1c3e64c81ee6afc9e93b05fe6
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-17T15:53:25Z

    refactoring & comments

commit 9f64e2931eabd2fcc5909123e73c9c046caceb3b
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-18T04:03:37Z

    Review comments

commit be3b3dbd2d813a3d1d164d9b7f8127d09b752880
Author: Sital Kedia <sk...@fb.com>
Date:   2017-03-24T22:39:05Z

    Minor changes as per review comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #78015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78015/testReport)** for PR 18150 at commit [`74f285a`](https://github.com/apache/spark/commit/74f285ac1faa7a67d9943a14c49df06097755ff0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121209197
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -190,6 +190,12 @@ class DAGScheduler(
       /**
        * Number of consecutive stage attempts allowed before a stage is aborted.
        */
    +  private[scheduler] val unRegisterOutputOnHostOnFetchFailure =
    +    sc.getConf.getBoolean("spark.fetch.failure.unRegister.output.on.host", true)
    +
    +  /**
    --- End diff --
    
    Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77559 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77559/testReport)** for PR 18150 at commit [`500938d`](https://github.com/apache/spark/commit/500938d698cfb0e502f2b35f4657424be8f70450).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121024072
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1383,19 +1394,43 @@ class DAGScheduler(
        */
       private[scheduler] def handleExecutorLost(
           execId: String,
    -      filesLost: Boolean,
    -      maybeEpoch: Option[Long] = None) {
    +      workerLost: Boolean): Unit = {
    +    // if the cluster manager explicitly tells us that the entire worker was lost, then
    +    // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
    +    // from a Standalone cluster, where the shuffle service lives in the Worker.)
    +    val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
    +    removeExecutorAndUnregisterOutputs(
    +      execId = execId,
    +      fileLost = fileLost,
    +      hostToUnregisterOutputs = None,
    +      maybeEpoch = None)
    +  }
    +
    +  private def removeExecutorAndUnregisterOutputs(
    +      execId: String,
    +      fileLost: Boolean,
    +      hostToUnregisterOutputs: Option[String],
    +      maybeEpoch: Option[Long] = None): Unit = {
         val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
         if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
           failedEpoch(execId) = currentEpoch
           logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
           blockManagerMaster.removeExecutor(execId)
    -
    -      if (filesLost || !env.blockManager.externalShuffleServiceEnabled) {
    -        logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +      if (fileLost) {
    +        hostToUnregisterOutputs match {
    +          case Some(host) =>
    +            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
    +          case None =>
    +            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +        }
             // TODO: This will be really slow if we keep accumulating shuffle map stages
             for ((shuffleId, stage) <- shuffleIdToMapStage) {
    -          stage.removeOutputsOnExecutor(execId)
    +          hostToUnregisterOutputs match {
    --- End diff --
    
    @JoshRosen - I think this is a good idea to have this behavior configurable so that users have a fall back option. @tgravescs  - What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121021307
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1383,19 +1394,43 @@ class DAGScheduler(
        */
       private[scheduler] def handleExecutorLost(
           execId: String,
    -      filesLost: Boolean,
    -      maybeEpoch: Option[Long] = None) {
    +      workerLost: Boolean): Unit = {
    +    // if the cluster manager explicitly tells us that the entire worker was lost, then
    +    // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
    +    // from a Standalone cluster, where the shuffle service lives in the Worker.)
    +    val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
    +    removeExecutorAndUnregisterOutputs(
    +      execId = execId,
    +      fileLost = fileLost,
    +      hostToUnregisterOutputs = None,
    +      maybeEpoch = None)
    +  }
    +
    +  private def removeExecutorAndUnregisterOutputs(
    +      execId: String,
    +      fileLost: Boolean,
    +      hostToUnregisterOutputs: Option[String],
    +      maybeEpoch: Option[Long] = None): Unit = {
         val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
         if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
           failedEpoch(execId) = currentEpoch
           logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
           blockManagerMaster.removeExecutor(execId)
    -
    -      if (filesLost || !env.blockManager.externalShuffleServiceEnabled) {
    -        logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +      if (fileLost) {
    +        hostToUnregisterOutputs match {
    +          case Some(host) =>
    +            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
    +          case None =>
    +            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +        }
             // TODO: This will be really slow if we keep accumulating shuffle map stages
             for ((shuffleId, stage) <- shuffleIdToMapStage) {
    -          stage.removeOutputsOnExecutor(execId)
    +          hostToUnregisterOutputs match {
    --- End diff --
    
    Ah, I remember now: the problem is that the current code already avoids the problem with the "1 -> 2 -> 3" cascading retries issue because it's _already_ treating any fetch failure as a complete executor output loss via
    
    ```
    handleExecutorLost(bmAddress.executorId, filesLost = true, Some(task.epoch))
    ```
    
    Over on the earlier PR, there were a number of arguments that "fetch failure does not imply lost executor", but those are kind of moot given that we're _already_ handling it that way in today's code and before the changes proposed here.
    
    So now I see why the PR description focuses on the multiple executors per host scenario: this isn't actually changing behavior in the single-executor-per-host world.
    
    In the case where the host is legitimately down then _most likely_ we'll get fetch failures from all of the executors on that host and will remove all of the outputs, so in some scenarios we'll happen to do the right thing. But I can imagine how there are scenarios on very large clusters where we'll get unlucky and won't observe fetch failures from the complete set of executors being served from that host's shuffle service; this could be especially likely to happen if the shuffle service is serving outputs from many dead executors which OOM'd and restarted during the map phase. Therefore I can understand the argument here that it's best to just round up.
    
    The only real question here is how often we'd regret doing this in practice: how often is a fetch failure actually a transient issue? Given that there are already fetch retry mechanisms inside of tasks, I'm guessing that the "false positive" scenario is somewhat rare.
    
    Therefore, I'm now persuaded that this is a good change and I have a better understanding of how this fits into the larger context of fetch failure handling issues.
    
    Since this has been somewhat controversial, what do you think about compromising and adding a feature-flag which lets users opt-in to the old behavior (i.e. the flag just disables the promotion of "all outputs on executor lost" to "all outputs on host lost")?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121569579
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -396,6 +396,69 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou
         assertDataStructuresEmpty()
       }
     
    +  test("All shuffle files should on the slave should be cleaned up when slave lost") {
    +    // reset the test context with the right shuffle service config
    +    afterEach()
    +    val conf = new SparkConf()
    +    conf.set("spark.shuffle.service.enabled", "true")
    +    conf.set("spark.files.fetchFailure.unRegisterOutputOnHost", "true")
    +    init(conf)
    +    runEvent(ExecutorAdded("exec-hostA1", "hostA"))
    +    runEvent(ExecutorAdded("exec-hostA2", "hostA"))
    +    runEvent(ExecutorAdded("exec-hostB", "hostB"))
    +    val firstRDD = new MyRDD(sc, 3, Nil)
    +    val firstShuffleDep = new ShuffleDependency(firstRDD, new HashPartitioner(3))
    +    val firstShuffleId = firstShuffleDep.shuffleId
    +    val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep))
    +    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
    +    val reduceRdd = new MyRDD(sc, 1, List(shuffleDep))
    +    submit(reduceRdd, Array(0))
    +    // map stage1 completes successfully, with one task on each executor
    +    complete(taskSets(0), Seq(
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA1", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA2", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success, makeMapStatus("hostB", 1))
    +    ))
    +    // map stage2 completes successfully, with one task on each executor
    +    complete(taskSets(1), Seq(
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA1", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA2", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success, makeMapStatus("hostB", 1))
    +    ))
    +    // make sure our test setup is correct
    +    val initialMapStatus1 = mapOutputTracker.mapStatuses.get(0).get
    +    assert(initialMapStatus1.count(_ != null) === 3)
    +    assert(initialMapStatus1.map{_.location.executorId}.toSet ===
    +      Set("exec-hostA1", "exec-hostA2", "exec-hostB"))
    +
    +    val initialMapStatus2 = mapOutputTracker.mapStatuses.get(0).get
    --- End diff --
    
    `mapOutputTracker.mapStatuses.get(1)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Please note that the old PR - https://github.com/apache/spark/pull/17088 was closed inadventantly as a stale PR. Refer to the old PR for more discussion. 
    
    cc - @tgravescs 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77557/testReport)** for PR 18150 at commit [`be3b3db`](https://github.com/apache/spark/commit/be3b3dbd2d813a3d1d164d9b7f8127d09b752880).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121827490
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -396,6 +396,69 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou
         assertDataStructuresEmpty()
       }
     
    +  test("All shuffle files should on the slave should be cleaned up when slave lost") {
    --- End diff --
    
    fixed, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77853/testReport)** for PR 18150 at commit [`5bd4bc3`](https://github.com/apache/spark/commit/5bd4bc3554a8238d673358143624931b6e0d2115).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121827512
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -396,6 +396,69 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou
         assertDataStructuresEmpty()
       }
     
    +  test("All shuffle files should on the slave should be cleaned up when slave lost") {
    +    // reset the test context with the right shuffle service config
    +    afterEach()
    +    val conf = new SparkConf()
    +    conf.set("spark.shuffle.service.enabled", "true")
    +    conf.set("spark.files.fetchFailure.unRegisterOutputOnHost", "true")
    +    init(conf)
    +    runEvent(ExecutorAdded("exec-hostA1", "hostA"))
    +    runEvent(ExecutorAdded("exec-hostA2", "hostA"))
    +    runEvent(ExecutorAdded("exec-hostB", "hostB"))
    +    val firstRDD = new MyRDD(sc, 3, Nil)
    +    val firstShuffleDep = new ShuffleDependency(firstRDD, new HashPartitioner(3))
    +    val firstShuffleId = firstShuffleDep.shuffleId
    +    val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep))
    +    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
    +    val reduceRdd = new MyRDD(sc, 1, List(shuffleDep))
    +    submit(reduceRdd, Array(0))
    +    // map stage1 completes successfully, with one task on each executor
    +    complete(taskSets(0), Seq(
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA1", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA2", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success, makeMapStatus("hostB", 1))
    +    ))
    +    // map stage2 completes successfully, with one task on each executor
    +    complete(taskSets(1), Seq(
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA1", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success,
    +        MapStatus(BlockManagerId("exec-hostA2", "hostA", 12345), Array.fill[Long](1)(2))),
    +      (Success, makeMapStatus("hostB", 1))
    +    ))
    +    // make sure our test setup is correct
    +    val initialMapStatus1 = mapOutputTracker.mapStatuses.get(0).get
    +    assert(initialMapStatus1.count(_ != null) === 3)
    +    assert(initialMapStatus1.map{_.location.executorId}.toSet ===
    +      Set("exec-hostA1", "exec-hostA2", "exec-hostB"))
    +
    +    val initialMapStatus2 = mapOutputTracker.mapStatuses.get(0).get
    --- End diff --
    
    good eye, fixed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    @jiangxb1987 - Sure, made a change to turn off the behavior by default. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Thanks for the review @JoshRosen , @jiangxb1987. Added a config to turn of the behavior and addressed minor comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121204287
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala ---
    @@ -133,6 +133,28 @@ private[spark] class ShuffleMapStage(
       }
     
       /**
    +   * Removes all shuffle outputs associated with this host. Note that this will also remove
    +   * outputs which are served by an external shuffle server (if one exists), as they are still
    +   * registered with this execId.
    +   */
    +  def removeOutputsOnHost(host: String): Unit = {
    --- End diff --
    
    Good suggestion, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121210130
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala ---
    @@ -138,17 +138,7 @@ private[spark] class ShuffleMapStage(
        * registered with this execId.
    --- End diff --
    
    We should also update the comment here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #78016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78016/testReport)** for PR 18150 at commit [`ba2ca2a`](https://github.com/apache/spark/commit/ba2ca2adfe15977fe48ccc97b282e97381135f67).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    ping @JoshRosen @tgravescs @squito


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121844707
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1370,22 +1393,42 @@ class DAGScheduler(
        */
       private[scheduler] def handleExecutorLost(
           execId: String,
    -      filesLost: Boolean,
    -      maybeEpoch: Option[Long] = None) {
    +      workerLost: Boolean): Unit = {
    +    // if the cluster manager explicitly tells us that the entire worker was lost, then
    +    // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
    +    // from a Standalone cluster, where the shuffle service lives in the Worker.)
    +    val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
    +    removeExecutorAndUnregisterOutputs(
    +      execId = execId,
    +      fileLost = fileLost,
    +      hostToUnregisterOutputs = None,
    --- End diff --
    
    one more question: if worker lost, shouldn't we unregister outputs on that worker/host?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77859/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121569075
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -396,6 +396,69 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou
         assertDataStructuresEmpty()
       }
     
    +  test("All shuffle files should on the slave should be cleaned up when slave lost") {
    --- End diff --
    
    typo: `All shuffle files should on the slave` -> `All shuffle files on the slave`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77853/testReport)** for PR 18150 at commit [`5bd4bc3`](https://github.com/apache/spark/commit/5bd4bc3554a8238d673358143624931b6e0d2115).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121846539
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1370,22 +1393,42 @@ class DAGScheduler(
        */
       private[scheduler] def handleExecutorLost(
           execId: String,
    -      filesLost: Boolean,
    -      maybeEpoch: Option[Long] = None) {
    +      workerLost: Boolean): Unit = {
    +    // if the cluster manager explicitly tells us that the entire worker was lost, then
    +    // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
    +    // from a Standalone cluster, where the shuffle service lives in the Worker.)
    +    val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
    +    removeExecutorAndUnregisterOutputs(
    +      execId = execId,
    +      fileLost = fileLost,
    +      hostToUnregisterOutputs = None,
    --- End diff --
    
    seems we can't get worker id here, nvm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78016/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #78015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78015/testReport)** for PR 18150 at commit [`74f285a`](https://github.com/apache/spark/commit/74f285ac1faa7a67d9943a14c49df06097755ff0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121089252
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala ---
    @@ -133,6 +133,28 @@ private[spark] class ShuffleMapStage(
       }
     
       /**
    +   * Removes all shuffle outputs associated with this host. Note that this will also remove
    +   * outputs which are served by an external shuffle server (if one exists), as they are still
    +   * registered with this execId.
    +   */
    +  def removeOutputsOnHost(host: String): Unit = {
    --- End diff --
    
    This function is pretty overlap in code with `removeOutputsOnExecutor`, How about combine them as:
    ```
      def removeOutputsByFilter(f: (BlockManagerId) => Boolean): Unit = {
        ......
          val newList = prevList.filterNot(m => f(m))
        ......
      }
    ```
    And in DAGScheduler, we can simply pass in the filter functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77859/testReport)** for PR 18150 at commit [`ec89ac1`](https://github.com/apache/spark/commit/ec89ac1d53beb5a598f12d3a0bd4c6774cdd1976).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77557/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121209859
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala ---
    @@ -160,19 +150,28 @@ private[spark] class ShuffleMapStage(
        * registered with this execId.
        */
       def removeOutputsOnExecutor(execId: String): Unit = {
    +    if (removeOutputsByFilter(x => x.executorId == execId)) {
    +      logInfo("%s is now unavailable on executor %s (%d/%d, %s)".format(
    +        this, execId, _numAvailableOutputs, numPartitions, isAvailable))
    +    }
    +  }
    +
    +  /**
    +   * Removes all shuffle outputs which satisfies the filter. Note that this will also
    +   * remove outputs which are served by an external shuffle server (if one exists),
    +   * as they are still registered with this execId.
    --- End diff --
    
    We don't mention execId here, so we should update the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77864/testReport)** for PR 18150 at commit [`78bce79`](https://github.com/apache/spark/commit/78bce792c5b057120ba1e4e02ed00cd0c3487629).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121214040
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -190,6 +190,12 @@ class DAGScheduler(
       /**
        * Number of consecutive stage attempts allowed before a stage is aborted.
        */
    +  private[scheduler] val unRegisterOutputOnHostOnFetchFailure =
    +    sc.getConf.getBoolean("spark.fetch.failure.unRegister.output.on.host", true)
    --- End diff --
    
    How about rename it to `spark.files.fetchFailure.unRegisterOutputOnHost` and make it default to false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121227176
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -562,7 +568,6 @@ class DAGScheduler(
        *
        * @return a JobWaiter object that can be used to block until the job finishes executing
        *         or can be used to cancel the job.
    -   *
    --- End diff --
    
    nit: Let's keep this empty line too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77864/testReport)** for PR 18150 at commit [`78bce79`](https://github.com/apache/spark/commit/78bce792c5b057120ba1e4e02ed00cd0c3487629).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121567987
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -188,6 +188,14 @@ class DAGScheduler(
       private val disallowStageRetryForTest = sc.getConf.getBoolean("spark.test.noStageRetry", false)
     
       /**
    +   * Whether to unregister all the outputs on the host in condition that we receive a FetchFailure,
    +   * this is set default to false, which means, we only unregister the outputs related to the exact
    +   * executor(instead of the host) on a FetchFailure.
    +   */
    +  private[scheduler] val unRegisterOutputOnHostOnFetchFailure =
    +    sc.getConf.getBoolean("spark.files.fetchFailure.unRegisterOutputOnHost", false)
    --- End diff --
    
    can we put this in `org.apache.spark.internal.config`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77859/testReport)** for PR 18150 at commit [`ec89ac1`](https://github.com/apache/spark/commit/ec89ac1d53beb5a598f12d3a0bd4c6774cdd1976).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Sorry been out.on vacation I think invalidating all and having feature flag makes sense for now. If we get more data on it causing issues we can revisit. Sorry won't have time to review in detail for a couple days.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121083806
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -559,10 +559,8 @@ class DAGScheduler(
        * @param callSite where in the user program this job was called
        * @param resultHandler callback to pass each result to
        * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
    -   *
    --- End diff --
    
    nit: please keep the origin format.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121228041
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -188,6 +188,12 @@ class DAGScheduler(
       private val disallowStageRetryForTest = sc.getConf.getBoolean("spark.test.noStageRetry", false)
     
       /**
    +   * If enabled, fetch failure will cause all the output on that host to be unregistered.
    --- End diff --
    
    Rephrase this by
    ```
    Whether to unregister all the outputs on the host in condition that we receive a FetchFailure, this is set default to false, which means, we only unregister the outputs related to the exact executor(instead of the host) on a FetchFailure.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r119927764
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1383,19 +1394,43 @@ class DAGScheduler(
        */
       private[scheduler] def handleExecutorLost(
           execId: String,
    -      filesLost: Boolean,
    -      maybeEpoch: Option[Long] = None) {
    +      workerLost: Boolean): Unit = {
    +    // if the cluster manager explicitly tells us that the entire worker was lost, then
    +    // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
    +    // from a Standalone cluster, where the shuffle service lives in the Worker.)
    +    val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
    +    removeExecutorAndUnregisterOutputs(
    +      execId = execId,
    +      fileLost = fileLost,
    +      hostToUnregisterOutputs = None,
    +      maybeEpoch = None)
    +  }
    +
    +  private def removeExecutorAndUnregisterOutputs(
    +      execId: String,
    +      fileLost: Boolean,
    +      hostToUnregisterOutputs: Option[String],
    +      maybeEpoch: Option[Long] = None): Unit = {
         val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
         if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
           failedEpoch(execId) = currentEpoch
           logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
           blockManagerMaster.removeExecutor(execId)
    -
    -      if (filesLost || !env.blockManager.externalShuffleServiceEnabled) {
    -        logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +      if (fileLost) {
    +        hostToUnregisterOutputs match {
    +          case Some(host) =>
    +            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
    +          case None =>
    +            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +        }
             // TODO: This will be really slow if we keep accumulating shuffle map stages
             for ((shuffleId, stage) <- shuffleIdToMapStage) {
    -          stage.removeOutputsOnExecutor(execId)
    +          hostToUnregisterOutputs match {
    --- End diff --
    
    I don't think that would be the ideal behavior in case of fetch failure. Consider the case where the DAG looks like this 1 -> 2 -> 3, if we see any fetch failure while running stage 3, the current behavior invalidates the files in stage 2 as well as stage 1. That way we make sure we rerun the tasks in stage 1 first then stage 2 and at last stage 3.
    
    If we do not invalidate the files in stage 1 (as per @Josh's suggestion), then we would unnecessarily rerun the tasks in stage 2 to encounter another fetch failure and realize that the files for stage 1 are also missing. This behavior would introduce significant latency overhead.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121217764
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -190,6 +190,12 @@ class DAGScheduler(
       /**
        * Number of consecutive stage attempts allowed before a stage is aborted.
        */
    +  private[scheduler] val unRegisterOutputOnHostOnFetchFailure =
    +    sc.getConf.getBoolean("spark.fetch.failure.unRegister.output.on.host", true)
    --- End diff --
    
    @jiangxb1987 - Do we want to set the default to false? If we believe that this is the correct and expected behavior, we should set it to true and in case we see issues, we can turn it off?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77856/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77557/testReport)** for PR 18150 at commit [`be3b3db`](https://github.com/apache/spark/commit/be3b3dbd2d813a3d1d164d9b7f8127d09b752880).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121018254
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1383,19 +1394,43 @@ class DAGScheduler(
        */
       private[scheduler] def handleExecutorLost(
           execId: String,
    -      filesLost: Boolean,
    -      maybeEpoch: Option[Long] = None) {
    +      workerLost: Boolean): Unit = {
    +    // if the cluster manager explicitly tells us that the entire worker was lost, then
    +    // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
    +    // from a Standalone cluster, where the shuffle service lives in the Worker.)
    +    val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
    +    removeExecutorAndUnregisterOutputs(
    +      execId = execId,
    +      fileLost = fileLost,
    +      hostToUnregisterOutputs = None,
    +      maybeEpoch = None)
    +  }
    +
    +  private def removeExecutorAndUnregisterOutputs(
    +      execId: String,
    +      fileLost: Boolean,
    +      hostToUnregisterOutputs: Option[String],
    +      maybeEpoch: Option[Long] = None): Unit = {
         val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
         if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
           failedEpoch(execId) = currentEpoch
           logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
           blockManagerMaster.removeExecutor(execId)
    -
    -      if (filesLost || !env.blockManager.externalShuffleServiceEnabled) {
    -        logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +      if (fileLost) {
    +        hostToUnregisterOutputs match {
    +          case Some(host) =>
    +            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
    +          case None =>
    +            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +        }
             // TODO: This will be really slow if we keep accumulating shuffle map stages
             for ((shuffleId, stage) <- shuffleIdToMapStage) {
    -          stage.removeOutputsOnExecutor(execId)
    +          hostToUnregisterOutputs match {
    --- End diff --
    
    @sitalkedia, that's a good point: the real gains here don't necessarily come from the "multiple executors on a host" scenario; instead, it seems like the key benefit is avoiding a bunch of followup stage failures to discover the root of what needs to be recomputed. Your example in this comment is one of the most clear problem statements of this that I've seen so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    LGTM, waiting for the rebase


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r119736751
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1383,19 +1394,43 @@ class DAGScheduler(
        */
       private[scheduler] def handleExecutorLost(
           execId: String,
    -      filesLost: Boolean,
    -      maybeEpoch: Option[Long] = None) {
    +      workerLost: Boolean): Unit = {
    +    // if the cluster manager explicitly tells us that the entire worker was lost, then
    +    // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
    +    // from a Standalone cluster, where the shuffle service lives in the Worker.)
    +    val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
    +    removeExecutorAndUnregisterOutputs(
    +      execId = execId,
    +      fileLost = fileLost,
    +      hostToUnregisterOutputs = None,
    +      maybeEpoch = None)
    +  }
    +
    +  private def removeExecutorAndUnregisterOutputs(
    +      execId: String,
    +      fileLost: Boolean,
    +      hostToUnregisterOutputs: Option[String],
    +      maybeEpoch: Option[Long] = None): Unit = {
         val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
         if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
           failedEpoch(execId) = currentEpoch
           logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
           blockManagerMaster.removeExecutor(execId)
    -
    -      if (filesLost || !env.blockManager.externalShuffleServiceEnabled) {
    -        logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +      if (fileLost) {
    +        hostToUnregisterOutputs match {
    +          case Some(host) =>
    +            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
    +          case None =>
    +            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
    +        }
             // TODO: This will be really slow if we keep accumulating shuffle map stages
             for ((shuffleId, stage) <- shuffleIdToMapStage) {
    -          stage.removeOutputsOnExecutor(execId)
    +          hostToUnregisterOutputs match {
    --- End diff --
    
    so the thing I believe @joshrosen was talking about in the jira was instead of invalidating all stages just invalidate the outputs for that mapSTage on that host.  
    
    So something like for(all execs on failed host) stage.removeOutputsOnExecutor(exec)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77559 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77559/testReport)** for PR 18150 at commit [`500938d`](https://github.com/apache/spark/commit/500938d698cfb0e502f2b35f4657424be8f70450).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77856/testReport)** for PR 18150 at commit [`810a101`](https://github.com/apache/spark/commit/810a10131200f46902a7563422558c63967f5c09).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18150


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77853/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #78016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78016/testReport)** for PR 18150 at commit [`ba2ca2a`](https://github.com/apache/spark/commit/ba2ca2adfe15977fe48ccc97b282e97381135f67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77864/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77559/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78015/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    **[Test build #77856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77856/testReport)** for PR 18150 at commit [`810a101`](https://github.com/apache/spark/commit/810a10131200f46902a7563422558c63967f5c09).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121827539
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -188,6 +188,14 @@ class DAGScheduler(
       private val disallowStageRetryForTest = sc.getConf.getBoolean("spark.test.noStageRetry", false)
     
       /**
    +   * Whether to unregister all the outputs on the host in condition that we receive a FetchFailure,
    +   * this is set default to false, which means, we only unregister the outputs related to the exact
    +   * executor(instead of the host) on a FetchFailure.
    +   */
    +  private[scheduler] val unRegisterOutputOnHostOnFetchFailure =
    +    sc.getConf.getBoolean("spark.files.fetchFailure.unRegisterOutputOnHost", false)
    --- End diff --
    
    done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18150: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/18150
  
    @sitalkedia Could you rebase this with the latest master? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18150: [SPARK-19753][CORE] Un-register all shuffle outpu...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18150#discussion_r121209109
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -190,6 +190,12 @@ class DAGScheduler(
       /**
        * Number of consecutive stage attempts allowed before a stage is aborted.
        */
    +  private[scheduler] val unRegisterOutputOnHostOnFetchFailure =
    --- End diff --
    
    Please place correct comment over the value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org