You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jerryshao <gi...@git.apache.org> on 2017/03/01 06:03:38 UTC

[GitHub] spark pull request #17113: [SPARK-13669][Core]Improve the blacklist mechanis...

GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/17113

    [SPARK-13669][Core]Improve the blacklist mechanism to handle external shuffle service unavailable situation

    ## What changes were proposed in this pull request?
    
    Currently we are running into an issue with Yarn work preserving enabled + external shuffle service.
    In the work preserving enabled scenario, the failure of NM will not lead to the exit of executors, so executors can still accept and run the tasks. The problem here is when NM is failed, external shuffle service is actually inaccessible, so reduce tasks will always complain about the \u201cFetch failure\u201d, and the failure of reduce stage will make the parent stage (map stage) rerun. The tricky thing here is Spark scheduler is not aware of the unavailability of external shuffle service, and will reschedule the map tasks on the executor where NM is failed, and again reduce stage will be failed with \u201cFetch failure\u201d, and after 4 retries, the job is failed. This could also apply to other cluster manager with external shuffle service.
    
    So here the main problem is that we should avoid assigning tasks to those bad executors (where shuffle service is unavailable). Current Spark's blacklist mechanism could blacklist executors/nodes by failure tasks, but it doesn't handle this specific fetch failure scenario. So here propose to improve the current application blacklist mechanism to handle fetch failure issue (especially with external shuffle service unavailable issue), to blacklist the executors/nodes where shuffle fetch is unavailable. 
    
    ## How was this patch tested?
    
    Unit test and small cluster verification.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-13669

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17113
    
----
commit a90b23d894bfd568a054a5752c958c36aa58c79b
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-03-01T05:54:21Z

    Improve the blacklist mechanism to handle external shuffle service unavailable situation
    
    Change-Id: I1c0776ec98866c5294ea4ed5d98793fdcebf44ae

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75152/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][SPARK-20898][Core] Improve the blac...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r124171370
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala ---
    @@ -529,4 +529,59 @@ class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach with M
         verify(allocationClientMock).killExecutors(Seq("2"), true, true)
         verify(allocationClientMock).killExecutorsOnHost("hostA")
       }
    +
    +  test("fetch failure blacklisting kills executors, configured by BLACKLIST_KILL_ENABLED") {
    +    val allocationClientMock = mock[ExecutorAllocationClient]
    +    when(allocationClientMock.killExecutors(any(), any(), any())).thenReturn(Seq("called"))
    +    when(allocationClientMock.killExecutorsOnHost("hostA")).thenAnswer(new Answer[Boolean] {
    +      // To avoid a race between blacklisting and killing, it is important that the nodeBlacklist
    +      // is updated before we ask the executor allocation client to kill all the executors
    +      // on a particular host.
    +      override def answer(invocation: InvocationOnMock): Boolean = {
    +        if (blacklist.nodeBlacklist.contains("hostA") == false) {
    +          throw new IllegalStateException("hostA should be on the blacklist")
    +        }
    +        true
    +      }
    +    })
    +
    +    conf.set(config.BLACKLIST_FETCH_FAILURE_ENABLED, true)
    +    blacklist = new BlacklistTracker(listenerBusMock, conf, Some(allocationClientMock), clock)
    +
    +    // Disable auto-kill. Blacklist an executor and make sure killExecutors is not called.
    +    conf.set(config.BLACKLIST_KILL_ENABLED, false)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "1")
    +
    +    verify(allocationClientMock, never).killExecutors(any(), any(), any())
    +    verify(allocationClientMock, never).killExecutorsOnHost(any())
    +
    +    // Enable auto-kill. Blacklist an executor and make sure killExecutors is called.
    +    conf.set(config.BLACKLIST_KILL_ENABLED, true)
    +    blacklist = new BlacklistTracker(listenerBusMock, conf, Some(allocationClientMock), clock)
    +    clock.advance(1000)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "1")
    +
    +    verify(allocationClientMock).killExecutors(Seq("1"), true, true)
    +    verify(allocationClientMock, never).killExecutorsOnHost(any())
    +
    +    assert(blacklist.executorIdToBlacklistStatus.contains("1"))
    +    assert(blacklist.executorIdToBlacklistStatus("1").node === "hostA")
    +    assert(blacklist.executorIdToBlacklistStatus("1").expiryTime ===
    +      1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nextExpiryTime === 1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nodeIdToBlacklistExpiryTime.isEmpty)
    +
    +    // Enable external shuffle service to see if all the executors on this node will be killed.
    +    conf.set(config.SHUFFLE_SERVICE_ENABLED, true)
    +    clock.advance(1000)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "2")
    +
    +    verify(allocationClientMock, never).killExecutors(Seq("2"), true, true)
    +    verify(allocationClientMock).killExecutorsOnHost("hostA")
    +
    +    assert(blacklist.nodeIdToBlacklistExpiryTime.contains("hostA"))
    +    assert(blacklist.nodeIdToBlacklistExpiryTime("hostA") ===
    +      2000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nextExpiryTime === 1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +  }
    --- End diff --
    
    make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    these failures definitely look unrelated. I'll kick once more to try to get clean run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Looks like Jenkins is not so stable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Sorry @tgravescs I didn't test executor killing in real cluster. There has bug in it, so I pushed a commit to fix it. Thanks for your reviewing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r119152848
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -54,7 +54,7 @@ import org.apache.spark.util.{AccumulatorV2, ThreadUtils, Utils}
     private[spark] class TaskSchedulerImpl private[scheduler](
         val sc: SparkContext,
         val maxTaskFailures: Int,
    -    private[scheduler] val blacklistTrackerOpt: Option[BlacklistTracker],
    +    mockBlacklistTracker: Option[BlacklistTracker] = None,
    --- End diff --
    
    whats the point of this refactoring?  I think I see the point -- you are showing that these constructors are only directly called w/ a mock blacklist tracker, but overall I think it actually makes it harder to understand the code, its odd to see something called "mock" in the main code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77575 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77575/testReport)** for PR 17113 at commit [`9a14105`](https://github.com/apache/spark/commit/9a14105f88a7a932a4e831ff2377327500250645).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][SPARK-20898][Core] Improve the blac...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r124165462
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala ---
    @@ -529,4 +529,59 @@ class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach with M
         verify(allocationClientMock).killExecutors(Seq("2"), true, true)
         verify(allocationClientMock).killExecutorsOnHost("hostA")
       }
    +
    +  test("fetch failure blacklisting kills executors, configured by BLACKLIST_KILL_ENABLED") {
    +    val allocationClientMock = mock[ExecutorAllocationClient]
    +    when(allocationClientMock.killExecutors(any(), any(), any())).thenReturn(Seq("called"))
    +    when(allocationClientMock.killExecutorsOnHost("hostA")).thenAnswer(new Answer[Boolean] {
    +      // To avoid a race between blacklisting and killing, it is important that the nodeBlacklist
    +      // is updated before we ask the executor allocation client to kill all the executors
    +      // on a particular host.
    +      override def answer(invocation: InvocationOnMock): Boolean = {
    +        if (blacklist.nodeBlacklist.contains("hostA") == false) {
    +          throw new IllegalStateException("hostA should be on the blacklist")
    +        }
    +        true
    +      }
    +    })
    +
    +    conf.set(config.BLACKLIST_FETCH_FAILURE_ENABLED, true)
    +    blacklist = new BlacklistTracker(listenerBusMock, conf, Some(allocationClientMock), clock)
    +
    +    // Disable auto-kill. Blacklist an executor and make sure killExecutors is not called.
    +    conf.set(config.BLACKLIST_KILL_ENABLED, false)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "1")
    +
    +    verify(allocationClientMock, never).killExecutors(any(), any(), any())
    +    verify(allocationClientMock, never).killExecutorsOnHost(any())
    +
    +    // Enable auto-kill. Blacklist an executor and make sure killExecutors is called.
    +    conf.set(config.BLACKLIST_KILL_ENABLED, true)
    +    blacklist = new BlacklistTracker(listenerBusMock, conf, Some(allocationClientMock), clock)
    +    clock.advance(1000)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "1")
    +
    +    verify(allocationClientMock).killExecutors(Seq("1"), true, true)
    +    verify(allocationClientMock, never).killExecutorsOnHost(any())
    +
    +    assert(blacklist.executorIdToBlacklistStatus.contains("1"))
    +    assert(blacklist.executorIdToBlacklistStatus("1").node === "hostA")
    +    assert(blacklist.executorIdToBlacklistStatus("1").expiryTime ===
    +      1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nextExpiryTime === 1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nodeIdToBlacklistExpiryTime.isEmpty)
    +
    +    // Enable external shuffle service to see if all the executors on this node will be killed.
    +    conf.set(config.SHUFFLE_SERVICE_ENABLED, true)
    +    clock.advance(1000)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "2")
    +
    +    verify(allocationClientMock, never).killExecutors(Seq("2"), true, true)
    +    verify(allocationClientMock).killExecutorsOnHost("hostA")
    +
    +    assert(blacklist.nodeIdToBlacklistExpiryTime.contains("hostA"))
    +    assert(blacklist.nodeIdToBlacklistExpiryTime("hostA") ===
    +      2000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nextExpiryTime === 1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +  }
    --- End diff --
    
    If BLACKLIST_KILL_ENABLED=false, the scenario should be the same as [here](https://github.com/apache/spark/pull/17113/files#diff-4f889d7aaf9668baea6e9117b2af946bR551). It looks like a duplication not so necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118260789
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      logWarning(
    +        s"""
    +           |${config.BLACKLIST_FETCH_FAILURE_ENABLED.key} is enabled. If we blacklist
    +           |on fetch failures, we are implicitly saying that we believe the failure is
    +           |non-transient, and can't be recovered from (even if this is the first fetch failure).
    +           |If the external shuffle-service is on, then every other executor on this node would
    +           |be suffering from the same issue, so we should blacklist (and potentially kill) all
    +           |of them immediately.
    +         """.stripMargin)
    +
    +      val now = clock.getTimeMillis()
    +      val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS
    +      if (!executorIdToBlacklistStatus.contains(exec)) {
    +        logInfo(s"Blacklisting executor $exec due to fetch failure")
    +
    +        executorIdToBlacklistStatus.put(exec, BlacklistedExecutor(host, expiryTimeForNewBlacklists))
    --- End diff --
    
    I don't think we really need to do anything with blacklisting the individual executors here, just black list the node below, it should take care of doing all the executors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    changes LGTM. 
    
    @squito  did you have any further comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r106287979
  
    --- Diff: docs/configuration.md ---
    @@ -1411,6 +1411,15 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +  <td><code>spark.blacklist.application.fetchFailure.enabled</code></td>
    +  <td>false</td>
    +  <td>
    +    (Experimental) If set to "true", Spark will blacklist the executors immediately when the fetch failure
    +    happened. If external shuffle service is enabled, then the whole node will be blacklisted. This configuration
    +    is to handle some scenarios where shuffle fetch is available and cannot be recovered through retry.
    --- End diff --
    
    you may also want to add the same text as configs above about the timeout and dynamic allocation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r106245985
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -753,6 +753,12 @@ private[spark] class TaskSetManager(
               tasksSuccessful += 1
             }
             isZombie = true
    +
    +        if (fetchFailed.bmAddress != null) {
    +          blacklistTracker.foreach(_.updateBlacklistForFetchFailure(fetchFailed.bmAddress.host,
    +            fetchFailed.bmAddress.executorId, numTasks - tasksSuccessful))
    --- End diff --
    
    I don't think we should use `numTasks - tasksSuccessful` here.  The intent of `SparkListenerExecutorBlacklisted.taskFailures` field was meant to indicate how many failures occurred on the given executor.  In this case, its always just 1.
    
    Maybe we should really add another field "reason", or something like that, which includes whether it was for task failures or fetch failures.  Though I really don't think that is necessary ... its probably too much detail for the UI, and there is enough info in the logs for someone that needs to dive deeper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77575/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    > then you get fetch failure again and iterate until job failure
    
    At first I was thinking the node goes bad, but you first detect it via fetch failures -- in that case, you wouldn't need this, when the tasks get re-run on the bad node, they'd fail, and the node would still get blacklisted.  But you're saying the issue is if the node is actually totally healthy, so tasks can still write shuffle data, but the external shuffle service becomes unusable.
    
    I don't think of losing all the shuffle data to be as aggressive as blacklisting the node, but maybe you're right, its pretty similar.
    
    I guess I'm OK with this, though I'm still not sure when I'd tell users to turn this on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77367/testReport)** for PR 17113 at commit [`44c7108`](https://github.com/apache/spark/commit/44c7108bdf478f823f567d44ed703d445febf6fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78457/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][SPARK-20898][Core] Improve the blac...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r124165102
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,74 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    --- End diff --
    
    It might be a just self-preferred choice, this code was not written by me, I just made some refactoring.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    sorry i was vague -- I'm saying I'm ok with this as long as its (a) off by default and (b) experimental so we can change it around (which it is).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77582/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    > Another thing I thought about as I was reviewing this -- spark currently assumes that a fetchfailure is always the fault of the source, never the destination. I almost wonder if we should count it against both, with some sensible heuristic for looking at a collection of failures and deciding who is at fault.
    
    I think this makes sense to try to tell the difference and improve our logic there.
    
    > Perhaps we should think of a better way to choose the right behavior.
    
    Does this mean you don't want to go with this approach?  I'm actually not sure this is a huge change.   Its a decent change in behavior but for the cases where nodes really do go down this could help a lot.  I think Spark definitely doesn't handle this case well now.
    
    Sorry again I haven't done a full review been trying to think the entire fetch failure scenarios through and have just been busy with other things.
    
    One downside to adding the config BLACKLIST_FETCH_FAILURE_ENABLED limits us on possibly changing this functionality to say only blacklist on multiple fetch failures.  We could track fetch failures across stage attempts and say only after it fails X number of tasks which could be across stage attempts do we blacklist it.  I guess its marked as experimental so its a bit more ok for us to change it.   Perhaps X isn't just failed tasks but failed tasks from a % of different hosts.  You could potentially use that also to determine if the source is bad or the destination.  If many across hosts have failed then you expect the source is bad, if its just one destination then perhaps that is bad.  Still thinking this through.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @jerryshao  are you actually seeing issues with this on real customer/production jobs?  How often? NM failure for us is very rare.  I'm not familiar with how mesos would fail differently, the shuffle service there is started as a separate service correct? 
    
    We would definitely need to make sure that Spark's retry before really returning fetch failure are good enough to handle the cases with rolling upgrade or intermittent issues with the shuffle, but with our defaults of 3 retries at 5 seconds each not sure that would cover it.  
    
    If people are seeing a need for it and actual issues with it then I'm ok with the idea as long as we make it configurable to turn it off.  I'm not sure on the blacklist after one fetch failure either.  It would seem better to only blacklist after a couple of things have gotten the failure that way you would have a better confidence it was really an issue with the node you are fetching from.
    
    cc @squito 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78457/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #75152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75152/testReport)** for PR 17113 at commit [`f6cc47e`](https://github.com/apache/spark/commit/f6cc47e81ec9075d56af0ee5a31a8b4726a61c43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    sorry for the delay on this we have been having some discussion about scheduler changes and the fetch failure handling in the scheduler.  Since this is related holding off on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77367/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    sorry haven't had a chance to get to this to do full review, hopefully tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77582/testReport)** for PR 17113 at commit [`9a14105`](https://github.com/apache/spark/commit/9a14105f88a7a932a4e831ff2377327500250645).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r119366199
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -51,29 +51,19 @@ import org.apache.spark.util.{AccumulatorV2, ThreadUtils, Utils}
      * acquire a lock on us, so we need to make sure that we don't try to lock the backend while
      * we are holding a lock on ourselves.
      */
    -private[spark] class TaskSchedulerImpl private[scheduler](
    +private[spark] class TaskSchedulerImpl(
         val sc: SparkContext,
         val maxTaskFailures: Int,
    -    private[scheduler] val blacklistTrackerOpt: Option[BlacklistTracker],
         isLocal: Boolean = false)
       extends TaskScheduler with Logging {
     
       import TaskSchedulerImpl._
     
       def this(sc: SparkContext) = {
    -    this(
    -      sc,
    -      sc.conf.get(config.MAX_TASK_FAILURES),
    -      TaskSchedulerImpl.maybeCreateBlacklistTracker(sc))
    +    this(sc, sc.conf.get(config.MAX_TASK_FAILURES))
       }
     
    -  def this(sc: SparkContext, maxTaskFailures: Int, isLocal: Boolean) = {
    -    this(
    -      sc,
    -      maxTaskFailures,
    -      TaskSchedulerImpl.maybeCreateBlacklistTracker(sc),
    -      isLocal = isLocal)
    -  }
    +  private[scheduler] lazy val blacklistTrackerOpt = maybeCreateBlacklistTracker(sc)
    --- End diff --
    
    A comment here as to why we do this would be good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77582/testReport)** for PR 17113 at commit [`9a14105`](https://github.com/apache/spark/commit/9a14105f88a7a932a4e831ff2377327500250645).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @squito  just double checking, are you ok with this change and did you have any comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r106248521
  
    --- Diff: docs/configuration.md ---
    @@ -1411,6 +1411,15 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +  <td><code>spark.blacklist.application.fetchFailure.enabled</code></td>
    +  <td>false</td>
    +  <td>
    +    (Experimental) If set to "true", Spark will blacklist the executors immediately when the fetch failure
    +    happened. If external shuffle service is enabled, then the whole node will be blacklisted. This configuration
    +    is to handle some scenarios where shuffle fetch is available and cannot be recovered through retry.
    --- End diff --
    
    the last sentence is pretty confusing ... I am not sure what else to put, maybe just leave it out?  
    
    This change would be helpful when fetch failures come from non-transient issues, but would cause extra recomputations when the failures are transient.  Unfortunately that doesn't really give much guidance to the end user ... I feel like even ourselves we don't really have clear guidelines on when it should be set, its sort of an emergency escape-hatch we're leaving in place.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r106249028
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -1039,6 +1039,40 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
           .updateBlacklistForFailedTask(anyString(), anyString(), anyInt())
       }
     
    +  test("update application blacklist for shuffle-fetch") {
    +    // Setup a taskset, and fail some tasks for a fetch failure, preemption, denied commit,
    +    // and killed task.
    +    val conf = new SparkConf()
    +      .set(config.BLACKLIST_ENABLED, true)
    +      .set(config.SHUFFLE_SERVICE_ENABLED, true)
    +      .set(config.BLACKLIST_FETCH_FAILURE_ENABLED, true)
    +    sc = new SparkContext("local", "test", conf)
    +    sched = new FakeTaskScheduler(sc, ("exec1", "host1"), ("exec2", "host2"))
    +    val taskSet = FakeTask.createTaskSet(4)
    +    val blacklistTracker = new BlacklistTracker(sc, None)
    +    val tsm = new TaskSetManager(sched, taskSet, 4, Some(blacklistTracker))
    +
    +    // make some offers to our taskset, to get tasks we will fail
    +    val taskDescs = Seq(
    +      "exec1" -> "host1",
    +      "exec2" -> "host2"
    +    ).flatMap { case (exec, host) =>
    +      // offer each executor twice (simulating 2 cores per executor)
    +      (0 until 2).flatMap{ _ => tsm.resourceOffer(exec, host, TaskLocality.ANY)}
    +    }
    +    assert(taskDescs.size === 4)
    +
    +    assert(!blacklistTracker.isExecutorBlacklisted(taskDescs(0).executorId))
    +    assert(!blacklistTracker.isNodeBlacklisted("host1"))
    +
    +    // Fail the task with fetch failure
    +    tsm.handleFailedTask(taskDescs(0).taskId, TaskState.FAILED,
    +      FetchFailed(BlockManagerId(taskDescs(0).executorId, "host1", 12345), 0, 0, 0, "ignored"))
    +
    +    assert(blacklistTracker.isExecutorBlacklisted(taskDescs(0).executorId))
    +    assert(blacklistTracker.isNodeBlacklisted("host1"))
    +  }
    --- End diff --
    
    I think you should also have a test with kill enabled, to see that all the executors on the host get killed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75095/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][SPARK-20898][Core] Improve the blac...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r124055023
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,74 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    --- End diff --
    
    Why not make this a private val?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    +1 finally got a clean build, will merge to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73668/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @squito Correct, we really only try to kill running tasks currently on job failure (and if the config setting allows it); but there is the long-standing "TODO: Cancel running tasks in the stage" in `case FetchFailed` of `DAGScheduler#handleTaskCompletion`, which has languished as a TODO both because resolving it would require us both to make Spark itself handle task interruption in the fetch failure case and to deal with the same issues preventing us from making task interruption the default even for job cancelation.
    
    All I'm really saying is that we shouldn't design in a hard requirement that tasks cannot be interrupted (on job cancelation, fetch failure or some other event), because we'd often really like to be able to kill running tasks -- even though we don't know quite how to do that safely right now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78446/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    I think killing tasks is only applicable in different scenarios, eg. if the [*job* fails](https://github.com/apache/spark/blob/12bf832407eaaed90d7c599522457cb36b303b6c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1513).  Currently, spark does not cancel running tasks when the stage fails due to a fetch failure.  The taskset is marked as a zombie, but running tasks are left alone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77346/testReport)** for PR 17113 at commit [`44c7108`](https://github.com/apache/spark/commit/44c7108bdf478f823f567d44ed703d445febf6fe).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    so unfortunately I haven't actually been seeing this. You can see with external shuffle is something happens to the NM and it does cause job failure.   NM crashes for OOM, something else kills it accidentally, etc.  In those cases the executors are still running fine and map tasks won't fail if its already registered, then you get fetch failure again and iterate until job failure. An alternative to this might be to have the executor re-ping the external shuffle service before starting a new task and have it fail at that point.
    
    We could also add in logic to only blacklist affect certain number of failures, but I don't see this as much different as SPARK-19753 where its going to invalidate immediately.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @mridulm Correct, turning task interruption on by default is not so much a matter of Spark itself handling it well as it is a possible (though not completely known) issue with lower layer libraries not handling interruption well. The original concern with HDFS is likely fixed now, but there are similar concerns with Cassandra and other libraries. Logically, we'd like to interrupt Tasks when associated Jobs or Stages are killed in the DAGScheduler. In practice, nobody knows right now how to do that safely in all circumstances, so the default is to not attempt to interrupt the tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r119150136
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    --- End diff --
    
    same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78334/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78301/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118751271
  
    --- Diff: docs/configuration.md ---
    @@ -1449,6 +1449,14 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +  <td><code>spark.blacklist.application.fetchFailure.enabled</code></td>
    +  <td>false</td>
    +  <td>
    +    (Experimental) If set to "true", Spark will blacklist the executors immediately when the fetch failure
    --- End diff --
    
    nit: "executor immediately when a fetch failure happens."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77446/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r119150113
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    --- End diff --
    
    `${config.BLACKLIST_KILL_ENABLED.key}`
    I realize you are just refactoring this, but might as well fix up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @tgravescs At the config level, it is spark.job.interruptOnCancel or SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, which then gets passed around as a boolean -- e.g. shouldInterruptThread.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77446/testReport)** for PR 17113 at commit [`524fbfc`](https://github.com/apache/spark/commit/524fbfcdd661f5919515b8e18515d346003a3ebe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77346/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    I'm curious did you test the killing part on an actual yarn job?  I was trying it on master and I don't think it works at all due to the way its passing allocation client.  Its a separate issue then this just wondering if you saw same thing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @tgravescs , the main scenario is external shuffle service unavailable scenario, this could be happened in working preserving + NM failure situation. Also like Mesos + external standalone shuffle service could introduce this issue. In scenarios like rolling upgrade I agreed that NM unavailability is short and this issue could be self-recoverable. One scenario I'm simulating is NM failure. In my test, when NM is failed, RM will detect this failure after 10 minutes by default, before that executors on that NM can still serve the tasks, and Spark doesn't blacklist these containers, so re-issued tasks could still be failed. 
    
    `FetchFailed` will immediately abort the running stage and re-issue parent stage, configurations like failed task number per stage may not be so useful, so my thinking is to backlist these executors/nodes immediately after fetch failure. 
    
    This proposal may have many problems for different scenario, that's why I opened here for comments. If you don't think it is necessary to fix then I could close it.
    
    @markhamstra this patch is targeted to master branch and all the investigations and changes is based on master branch.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78431/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78457/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    can you clarify the situations you are seeing issues? What happened to the NM in this case.  If you have work preserving restart I would think this would actually cause you more problems.  The NM could temporarily be down during rolling upgrade and if you blacklist it, it won't be used for a long time.
    
    We have seen issues with TEZ and MR where blacklisting on fetch failures caused more issues then it solved.  Most of the fetch failures were transient issues and it caused way more things to be rerun then was actually needed.  
    
    This is why we explicitly left it out of the blacklisting feature.  See the design doc here in the jira https://issues.apache.org/jira/browse/SPARK-8425.
    
    I didn't have a chance to do a full review but you seem to be blacklisting an executor, what executor is this blacklisting?  It looks like you are immediately blacklisting on any fetch failure rather then allowing configurable number?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r119151885
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      // If we blacklist on fetch failures, we are implicitly saying that we believe the failure is
    +      // non-transient, and can't be recovered from (even if this is the first fetch failure).
    --- End diff --
    
    I would expand this comment to explain why we don't wait for multiple failures -- stage is retried after just one failure, so we don't always get a chance to collect multiple fetch failures.
    
    But to be honest, I still really wonder about that part.  So what if takes a couple stages to get enough fetch failures to blacklist the node?  This seems so drastic that I'm having a tough time imagining a scenario where I'd actually recommend this setting to a user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118751130
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      // spark.blacklist.application.fetchFailure.enabled is enabled. If we blacklist
    +      // on fetch failures, we are implicitly saying that we believe the failure is
    +      // non-transient, and can't be recovered from (even if this is the first fetch failure).
    +      // If the external shuffle-service is on, then every other executor on this node would
    +      // be suffering from the same issue, so we should blacklist (and potentially kill) all
    +      // of them immediately.
    +
    +      val now = clock.getTimeMillis()
    +      val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS
    +
    +      if (conf.get(config.SHUFFLE_SERVICE_ENABLED)) {
    +        if (!nodeIdToBlacklistExpiryTime.contains(host)) {
    +          logInfo(s"blacklisting node $host due to fetch failure of external shuffle service")
    +
    +          nodeIdToBlacklistExpiryTime.put(host, expiryTimeForNewBlacklists)
    +          listenerBus.post(SparkListenerNodeBlacklisted(now, host, 1))
    +          _nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet)
    +          killExecutorsOnBlacklistedNode(host)
    --- End diff --
    
    it would also be nice to make sure we have a test for the expiry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118748969
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      // spark.blacklist.application.fetchFailure.enabled is enabled. If we blacklist
    +      // on fetch failures, we are implicitly saying that we believe the failure is
    +      // non-transient, and can't be recovered from (even if this is the first fetch failure).
    +      // If the external shuffle-service is on, then every other executor on this node would
    +      // be suffering from the same issue, so we should blacklist (and potentially kill) all
    +      // of them immediately.
    +
    +      val now = clock.getTimeMillis()
    +      val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS
    +
    +      if (conf.get(config.SHUFFLE_SERVICE_ENABLED)) {
    +        if (!nodeIdToBlacklistExpiryTime.contains(host)) {
    +          logInfo(s"blacklisting node $host due to fetch failure of external shuffle service")
    +
    +          nodeIdToBlacklistExpiryTime.put(host, expiryTimeForNewBlacklists)
    +          listenerBus.post(SparkListenerNodeBlacklisted(now, host, 1))
    +          _nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet)
    +          killExecutorsOnBlacklistedNode(host)
    --- End diff --
    
    I think we need to call updateNextExpiryTime() here now  since there isn't an explicit executorid being blacklisted which would have normally called it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77346/testReport)** for PR 17113 at commit [`44c7108`](https://github.com/apache/spark/commit/44c7108bdf478f823f567d44ed703d445febf6fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78334 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78334/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78334 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78334/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77446/testReport)** for PR 17113 at commit [`524fbfc`](https://github.com/apache/spark/commit/524fbfcdd661f5919515b8e18515d346003a3ebe).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    I'm fine with this, just a couple asks:
    
    1) is there a test for SPARK-20898?  I know at some point in the development of killing blacklisted executors we tested on yarn, I'm really disappointed (in myself) for merging it though it didn't actually work.  It would be nice to have a regression test.  Maybe that would be a huge pain to do, just wanted to see if you thought about it.
    
    2) Can you update the jira description for SPARK-13669 now?  I think a lot has changed since that was initially opened.  Also IIUC, we don't have any particular advice on when users should enable this at the moment, its just an escape hatch you'd like to leave in place?  If there is any advice, putting it in the jira would also make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77344/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][SPARK-20898][Core] Improve the blac...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r124060233
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala ---
    @@ -529,4 +529,59 @@ class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach with M
         verify(allocationClientMock).killExecutors(Seq("2"), true, true)
         verify(allocationClientMock).killExecutorsOnHost("hostA")
       }
    +
    +  test("fetch failure blacklisting kills executors, configured by BLACKLIST_KILL_ENABLED") {
    +    val allocationClientMock = mock[ExecutorAllocationClient]
    +    when(allocationClientMock.killExecutors(any(), any(), any())).thenReturn(Seq("called"))
    +    when(allocationClientMock.killExecutorsOnHost("hostA")).thenAnswer(new Answer[Boolean] {
    +      // To avoid a race between blacklisting and killing, it is important that the nodeBlacklist
    +      // is updated before we ask the executor allocation client to kill all the executors
    +      // on a particular host.
    +      override def answer(invocation: InvocationOnMock): Boolean = {
    +        if (blacklist.nodeBlacklist.contains("hostA") == false) {
    +          throw new IllegalStateException("hostA should be on the blacklist")
    +        }
    +        true
    +      }
    +    })
    +
    +    conf.set(config.BLACKLIST_FETCH_FAILURE_ENABLED, true)
    +    blacklist = new BlacklistTracker(listenerBusMock, conf, Some(allocationClientMock), clock)
    +
    +    // Disable auto-kill. Blacklist an executor and make sure killExecutors is not called.
    +    conf.set(config.BLACKLIST_KILL_ENABLED, false)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "1")
    +
    +    verify(allocationClientMock, never).killExecutors(any(), any(), any())
    +    verify(allocationClientMock, never).killExecutorsOnHost(any())
    +
    +    // Enable auto-kill. Blacklist an executor and make sure killExecutors is called.
    +    conf.set(config.BLACKLIST_KILL_ENABLED, true)
    +    blacklist = new BlacklistTracker(listenerBusMock, conf, Some(allocationClientMock), clock)
    +    clock.advance(1000)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "1")
    +
    +    verify(allocationClientMock).killExecutors(Seq("1"), true, true)
    +    verify(allocationClientMock, never).killExecutorsOnHost(any())
    +
    +    assert(blacklist.executorIdToBlacklistStatus.contains("1"))
    +    assert(blacklist.executorIdToBlacklistStatus("1").node === "hostA")
    +    assert(blacklist.executorIdToBlacklistStatus("1").expiryTime ===
    +      1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nextExpiryTime === 1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nodeIdToBlacklistExpiryTime.isEmpty)
    +
    +    // Enable external shuffle service to see if all the executors on this node will be killed.
    +    conf.set(config.SHUFFLE_SERVICE_ENABLED, true)
    +    clock.advance(1000)
    +    blacklist.updateBlacklistForFetchFailure("hostA", exec = "2")
    +
    +    verify(allocationClientMock, never).killExecutors(Seq("2"), true, true)
    +    verify(allocationClientMock).killExecutorsOnHost("hostA")
    +
    +    assert(blacklist.nodeIdToBlacklistExpiryTime.contains("hostA"))
    +    assert(blacklist.nodeIdToBlacklistExpiryTime("hostA") ===
    +      2000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +    assert(blacklist.nextExpiryTime === 1000 + blacklist.BLACKLIST_TIMEOUT_MILLIS)
    +  }
    --- End diff --
    
    Should we also test with SHUFFLE_SERVICE_ENABLED=true and BLACKLIST_KILL_ENABLED=false ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #75095 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75095/testReport)** for PR 17113 at commit [`f6cc47e`](https://github.com/apache/spark/commit/f6cc47e81ec9075d56af0ee5a31a8b4726a61c43).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118315661
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      logWarning(
    +        s"""
    +           |${config.BLACKLIST_FETCH_FAILURE_ENABLED.key} is enabled. If we blacklist
    +           |on fetch failures, we are implicitly saying that we believe the failure is
    +           |non-transient, and can't be recovered from (even if this is the first fetch failure).
    +           |If the external shuffle-service is on, then every other executor on this node would
    +           |be suffering from the same issue, so we should blacklist (and potentially kill) all
    +           |of them immediately.
    +         """.stripMargin)
    +
    +      val now = clock.getTimeMillis()
    +      val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS
    +      if (!executorIdToBlacklistStatus.contains(exec)) {
    +        logInfo(s"Blacklisting executor $exec due to fetch failure")
    +
    +        executorIdToBlacklistStatus.put(exec, BlacklistedExecutor(host, expiryTimeForNewBlacklists))
    --- End diff --
    
    we don't really need to blacklist the executor if the shuffle service is enabled and we are going to do the entire host anyway.  perhaps put this as else to the if external shuffle service is enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78301 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78301/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @markhamstra Completely agree, I would love to see this enabled by default. For example, I really hate to see speculative tasks continuing to run when the taskset has completed (for example) - used to make it necessary to overprovision to ensure subsequent taskset has all tasks running from get go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77622 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77622/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #73668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73668/testReport)** for PR 17113 at commit [`a90b23d`](https://github.com/apache/spark/commit/a90b23d894bfd568a054a5752c958c36aa58c79b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77367/testReport)** for PR 17113 at commit [`44c7108`](https://github.com/apache/spark/commit/44c7108bdf478f823f567d44ed703d445febf6fe).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    > Spark does immediately abort the stage but it doesn't kill the running tasks
    
    Whether running tasks are interrupted on stage abort or not depends on the state of a config boolean -- and ideally we'd like to get to the point where we can confidently set that config so that running tasks are interrupted when the associated job or stage dies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77622/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @markhamstra given the impact interruption has on lower layer libraries which dont handle it well (iirc hdfs ?), we probably will not set it to true even if spark code is robust,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78301/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @jerryshao  sorry my delay on this, we have rough design what we want to do for future changes but I think those are going to take a while and in the mean time I think this is a useful addition as we have been seeing issues this would have helped.   
    
    Could you bring this up to date and address the comments?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118260480
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      logWarning(
    +        s"""
    +           |${config.BLACKLIST_FETCH_FAILURE_ENABLED.key} is enabled. If we blacklist
    --- End diff --
    
    I don't think we need this long warning I would just remove.  perhaps leave it as a comment in the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
     > Whether running tasks are interrupted on stage abort or not depends on the state of a config boolean -- and ideally we'd like to get to the point where we can confidently set that config so 
    > that running tasks are interrupted when the associated job or stage dies.
    
    @markhamstra which config are you referring to?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #74234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74234/testReport)** for PR 17113 at commit [`7ba0623`](https://github.com/apache/spark/commit/7ba06231871fad25670cbc233283125c812e4f36).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    I was not talking about designing this around the killing task part of this, other then in reference to being able to count the # of fetch failures before triggering the blacklisting, but I think that would have to be handled across stages right now.  My main concern with this pr is invalidating the shuffle outputs.  I do also believe that fetch failures could be handled better but that is more being discussed in https://github.com/apache/spark/pull/17088. 
    
    I do also agree with you that the task killing part should be fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][SPARK-20898][Core] Improve the blac...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17113


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77344/testReport)** for PR 17113 at commit [`44c7108`](https://github.com/apache/spark/commit/44c7108bdf478f823f567d44ed703d445febf6fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118261422
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      logWarning(
    +        s"""
    +           |${config.BLACKLIST_FETCH_FAILURE_ENABLED.key} is enabled. If we blacklist
    +           |on fetch failures, we are implicitly saying that we believe the failure is
    +           |non-transient, and can't be recovered from (even if this is the first fetch failure).
    +           |If the external shuffle-service is on, then every other executor on this node would
    +           |be suffering from the same issue, so we should blacklist (and potentially kill) all
    +           |of them immediately.
    +         """.stripMargin)
    +
    +      val now = clock.getTimeMillis()
    +      val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS
    +      if (!executorIdToBlacklistStatus.contains(exec)) {
    +        logInfo(s"Blacklisting executor $exec due to fetch failure")
    +
    +        executorIdToBlacklistStatus.put(exec, BlacklistedExecutor(host, expiryTimeForNewBlacklists))
    +        // We hardcoded number of failure tasks to 1 for fetch failure, because there's no
    +        // reattempt for such failure.
    +        listenerBus.post(SparkListenerExecutorBlacklisted(now, exec, 1))
    +        updateNextExpiryTime()
    +        killBlacklistedExecutor(exec)
    +
    +        val blacklistedExecsOnNode = nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
    +        blacklistedExecsOnNode += exec
    +
    +        if (conf.getBoolean("spark.shuffle.service.enabled", false) &&
    +            !nodeIdToBlacklistExpiryTime.contains(host)) {
    +          logInfo(s"blacklisting node $host due to fetch failure of external shuffle service")
    +
    +          nodeIdToBlacklistExpiryTime.put(host, expiryTimeForNewBlacklists)
    +          listenerBus.post(SparkListenerNodeBlacklisted(now, exec, blacklistedExecsOnNode.size))
    --- End diff --
    
    the parameter to SparkListenerNodeBlacklisted  needs to be the node, not the exec.
    
    we should test on the UI to make sure this shows up correctly, I think there might be a race condition between this and when the fetchfailure code in dag scheduler would remove the executor so it doesn't always show up on UI


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r106246521
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,63 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String, numFailedTasks: Int): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      val now = clock.getTimeMillis()
    +      val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS
    +      if (!executorIdToBlacklistStatus.contains(exec)) {
    +        logInfo(s"Blacklisting executor $exec due to fetch failure")
    +
    +        executorIdToBlacklistStatus.put(exec, BlacklistedExecutor(host, expiryTimeForNewBlacklists))
    +        listenerBus.post(SparkListenerExecutorBlacklisted(now, exec, numFailedTasks))
    --- End diff --
    
    given my other comment at the callsite, I'd get rid of `numFailedTasks` as an parameter and just hardcode this to `1`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Thanks @tgravescs , I will update the code soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74234/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r118747665
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      // spark.blacklist.application.fetchFailure.enabled is enabled. If we blacklist
    --- End diff --
    
    I think we can remove "spark.blacklist.application.fetchFailure.enabled is enabled"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    "Current Spark's blacklist mechanism": please be more precise. The most recent released version of Spark, 2.1.0, does not include a lot of recent changes to blacklisting (mostly https://github.com/apache/spark/commit/93cdb8a7d0f124b4db069fd8242207c82e263c52). Are the problems you are describing fully explored with the master branch of Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r106248846
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -1039,6 +1039,40 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
           .updateBlacklistForFailedTask(anyString(), anyString(), anyInt())
       }
     
    +  test("update application blacklist for shuffle-fetch") {
    +    // Setup a taskset, and fail some tasks for a fetch failure, preemption, denied commit,
    +    // and killed task.
    --- End diff --
    
    comment is wrong


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #73668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73668/testReport)** for PR 17113 at commit [`a90b23d`](https://github.com/apache/spark/commit/a90b23d894bfd568a054a5752c958c36aa58c79b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    please add jira SPARK-20898 to the description since fixing that here 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #75095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75095/testReport)** for PR 17113 at commit [`f6cc47e`](https://github.com/apache/spark/commit/f6cc47e81ec9075d56af0ee5a31a8b4726a61c43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, test this please



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r106246809
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,63 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id $exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String, numFailedTasks: Int): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      val now = clock.getTimeMillis()
    +      val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS
    +      if (!executorIdToBlacklistStatus.contains(exec)) {
    +        logInfo(s"Blacklisting executor $exec due to fetch failure")
    +
    +        executorIdToBlacklistStatus.put(exec, BlacklistedExecutor(host, expiryTimeForNewBlacklists))
    +        listenerBus.post(SparkListenerExecutorBlacklisted(now, exec, numFailedTasks))
    +        updateNextExpiryTime()
    +        killBlacklistedExecutor(exec)
    +
    +        val blacklistedExecsOnNode = nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
    +        blacklistedExecsOnNode += exec
    +
    +        if (SparkEnv.get.blockManager.externalShuffleServiceEnabled &&
    +          !nodeIdToBlacklistExpiryTime.contains(host)) {
    --- End diff --
    
    nit: double-indent second line of `if`.
    
    also -- this looks extremely drastic, so I think its worth including a comment here about why we do this, something like:  If we blacklist on fetch failures, we are implicitly saying that we believe the failure is non-transient, and can't be recovered from (even if this is the first fetch failure).  If the external shuffle-service is on, then every other executor on this node would be suffering from the same issue, so we should blacklist (and potentially kill) all of them immediately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @tgravescs , I just added a configuration to turn off this feature by default.
    
    Do you have any further comments on it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78431/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Hi @squito , 
    
    For the 1st point, I tested manually in real cluster, but I'm not sure how to make it happen in UT, if you think it is necessary to add a UT about this issue, then I will think about how to address it.
    
    For the 2nd point, I will update the JIRA description, thanks for the remainder.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Emmm... the failure is irrelevant, retest this please. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #78446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78446/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    So I looked at this a little more.  I'm more ok with this since Spark doesn't actually invalidate the shuffle output. You are basically just trying to stop new tasks from running on the executors already on that host. Its either going to just blacklist those or kill them if you have that feature on.
    
    Part of the reason we left it off to begin with was again we didn't want to blacklist on the transient ones so we wanted to wait to see if it was truly an issue in real life.  if you do put this in I would like it configurable off until we have more data as to if its really a problem users see.
    
    Spark does immediately abort the stage but it doesn't kill the running tasks, so if other tasks fetch failure before it can rerun the map task the scheduler knows about them, but that is very timing dependent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Thanks @tgravescs , no problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @tgravescs , thanks a lot for your comments.
    
    Actually the issue here is a simulated one from my test cluster, I didn't get an issue report from real customers. 
    
    Yes, in most of the cases Shuffle fetch failure is transient and could be recovered through retry, but for some small jobs this gap between failure and recovery is enough for the job to be failed with retry.
    
    The difficult thing here is that for fetch failure, Spark immediately abort the stage without any retry of the tasks, so unlike normal failures of tasks, we may have no chance to monitor several fetch failures until we decide to blacklist it. That's why in this PR I immediately blacklist the executors/nodes in the application level after fetch failure. The solution is too strict that it will also blacklist some transient failure executors, that's what I mainly concerned about.
    
    So here looking for any comments, greatly appreciated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r119260232
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -54,7 +54,7 @@ import org.apache.spark.util.{AccumulatorV2, ThreadUtils, Utils}
     private[spark] class TaskSchedulerImpl private[scheduler](
         val sc: SparkContext,
         val maxTaskFailures: Int,
    -    private[scheduler] val blacklistTrackerOpt: Option[BlacklistTracker],
    +    mockBlacklistTracker: Option[BlacklistTracker] = None,
    --- End diff --
    
    @squito the previous code here in `maybeCreateBlacklistTracker` has initialing ordering issue, this method is called before `schedulerBackend` is created, so the `ExecutorAllocationClient` we got will always be None. So I'm trying to address this issue. Let me find out a better way to handle it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78446/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #77622 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77622/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    **[Test build #74234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74234/testReport)** for PR 17113 at commit [`7ba0623`](https://github.com/apache/spark/commit/7ba06231871fad25670cbc233283125c812e4f36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17113: [SPARK-13669][Core] Improve the blacklist mechani...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r107604884
  
    --- Diff: docs/configuration.md ---
    @@ -1411,6 +1411,15 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +  <td><code>spark.blacklist.application.fetchFailure.enabled</code></td>
    +  <td>false</td>
    +  <td>
    +    (Experimental) If set to "true", Spark will blacklist the executors immediately when the fetch failure
    +    happened. If external shuffle service is enabled, then the whole node will be blacklisted. This configuration
    +    is to handle some scenarios where shuffle fetch is available and cannot be recovered through retry.
    --- End diff --
    
    >you may also want to add the same text as configs above about the timeout and dynamic allocation
    
    What do you specifically point to? Sorry I cannot get it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org