You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by IgorBerman <gi...@git.apache.org> on 2018/02/20 12:32:04 UTC

[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

GitHub user IgorBerman opened a pull request:

    https://github.com/apache/spark/pull/20640

    [SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend

    ## What changes were proposed in this pull request?
    This is another variant of https://github.com/apache/spark/pull/17619
    I propose less intrusive change, which won't remove blacklisting of mesos slaves completely, but rather will give ability to control constant of max failures and to enable/disable this behavior, thus giving ability to decide in each usecase. 
    Proposed defaults will setup behavior as before to be "backward compatible"
    
    ## How was this patch tested?
    Created unit test for blacklisting of slaves
    
    @timout 
    @squito (as reviewer of previous PR)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/IgorBerman/spark SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20640.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20640
    
----
commit ca4eba2ac59eadd1c1f3c36a5960174bbc9e662f
Author: IgorBerman <ig...@...>
Date:   2018-02-20T09:18:49Z

    [SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r189441161
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
           cpus + totalCoresAcquired <= maxCores &&
           mem <= offerMem &&
           numExecutors < executorLimit &&
    -      slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
    +      !scheduler.nodeBlacklist().contains(offerHostname) &&
    --- End diff --
    
    Squito sounds reasonable. In the mean time we have to deal with a limitation at the mesos side where the value is hardcoded. So we can move with this incrementally.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r191272509
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -648,14 +645,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
               totalGpusAcquired -= gpus
               gpusByTaskId -= taskId
             }
    -        // If it was a failure, mark the slave as failed for blacklisting purposes
             if (TaskState.isFailed(state)) {
    -          slave.taskFailures += 1
    -
    -          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
    -            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
    -                "is Spark installed on it?")
    -          }
    +          logError(s"Task $taskId failed on Mesos slave $slaveId.")
    --- End diff --
    
    +1 ok


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #91911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91911/testReport)** for PR 20640 at commit [`2c47271`](https://github.com/apache/spark/commit/2c47271176b82e4859667ede9bb02b28b8fba50e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87558/testReport)** for PR 20640 at commit [`2240259`](https://github.com/apache/spark/commit/22402594645b3ee14106da61cb401d6555ba2e1b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87556/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88628/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r169497847
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -648,15 +645,6 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
               totalGpusAcquired -= gpus
               gpusByTaskId -= taskId
             }
    -        // If it was a failure, mark the slave as failed for blacklisting purposes
    -        if (TaskState.isFailed(state)) {
    -          slave.taskFailures += 1
    -
    -          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
    -            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
    --- End diff --
    
    Is it a concern to lose this error message? (I don't know anything about Mesos but it does seem potentially useful?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @felixcheung  @squito let's merge this to avoid the limitation of MAX_SLAVE_FAILURES = 2 (this is not good at all) and then for sake of completeness and correctness we adapt the fix in SPARK-16630 to mesos.
    I think blacklisting code for yarn is more sophisticated anyway. Thoughts?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    thats fine with me, but as I'm neither a user of mesos nor am I in touch w/ many mesos users, I'd like wait a bit for more opinions, given the ramifications of this change. (that shouldn't block work on the better version, if anybody wants to take that on ...) @IgorBerman @susanxhuynh
    
    blacklisting in yarn (at least, what spark does internally already) is really not much more sophisticated, at least before SPARK-16630; spark does tell yarn which hosts it has blacklisted so it wants to avoid for future executors, but thats about it.  Yarn itself is doing a little more as well, as it has its own disk health checker etc., and it'll try to exclude resources from *all* applications if it thinks they are bad.  but that is independent of changes within spark itself.
    
    also I'd like to see a jira for mesos to discuss the other improvements we've discussed here to be more like SPARK-16630 so we don't forget about it.  @skonto can you file that jira and try to capture some of the points that have been raised in the discussion here? (or maybe that jira exists already?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r169554433
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
           cpus + totalCoresAcquired <= maxCores &&
           mem <= offerMem &&
           numExecutors < executorLimit &&
    -      slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
    +      !scheduler.nodeBlacklist().contains(slaveId) &&
    --- End diff --
    
    You are right, thank! 
    
    here:
    https://github.com/apache/spark/blob/9e50a1d37a4cf0c34e20a7c1a910ceaff41535a2/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L193
    
    I've managed to track it to https://github.com/apache/spark/blob/e18d6f5326e0d9ea03d31de5ce04cb84d3b8ab37/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L852
    
    I'll change it to offerHostname


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87563/testReport)** for PR 20640 at commit [`1eb701c`](https://github.com/apache/spark/commit/1eb701cdd07652d4f196bc960b7d8218c0c46ba0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @skonto, @susanxhuynh, @squito So let's agree that:
    1. I'll revert log when there is some failure. I'll reword it to be something without "blacklisting"
    2. The blacklisting itself will be moved to BlacklistTracker(as it now)
    
    bottom line the only thing missing - adding log in a case of failure(but without counting number of failures etc)
    
    WDYT?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #91911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91911/testReport)** for PR 20640 at commit [`2c47271`](https://github.com/apache/spark/commit/2c47271176b82e4859667ede9bb02b28b8fba50e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    fyi I just merged SPARK-16330, which is currently very yarn specific, but I think you could easily refactor that code to reuse a lot for mesos too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r179013270
  
    --- Diff: resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala ---
    @@ -108,6 +108,28 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite
         verifyTaskLaunched(driver, "o2")
       }
     
    +  test("mesos declines offers from blacklisted slave") {
    +    setBackend()
    +
    +    // launches a task on a valid offer on slave s1
    +    val minMem = backend.executorMemory(sc) + 1024
    +    val minCpu = 4
    +    val offer1 = Resources(minMem, minCpu)
    +    offerResources(List(offer1))
    +    verifyTaskLaunched(driver, "o1")
    +
    +    // for any reason executor(aka mesos task) failed on s1
    +    val status = createTaskStatus("0", "s1", TaskState.TASK_FAILED)
    +    backend.statusUpdate(driver, status)
    +    when(taskScheduler.nodeBlacklist()).thenReturn(Set("hosts1"))
    --- End diff --
    
    just to re-iterate my point above -- in many cases, having an executor fail will *not* lead to `taskScheduler.nodeBlacklist()` changing as you're doing here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    > @squito reading the code here: https://github.com/apache/spark/pull/21068/files
    is there an option to update the info about blacklisted node when there is a mesos task failure.
    It is a bit inconvenient to lose such events and wait for spark tasks to fail which may never launch since you dont have any executors running anyway.
    
    That change does not cover mesos, to keep the scope a little smaller and because none of us really know how to test on mesos.  But it should be pretty straightforward to refactor whats there so you could use the same general logic in mesos.  I think it would be easy for you to unify a lot of that, even the failure validity interval stuff.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91930/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91639/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #88486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88486/testReport)** for PR 20640 at commit [`66ed5af`](https://github.com/apache/spark/commit/66ed5afae1a5b4856e84c805c7858d552f38b26a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by susanxhuynh <gi...@git.apache.org>.
Github user susanxhuynh commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @skonto Yes, I'm ok with that. Sorry for the delayed response.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r169391079
  
    --- Diff: resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala ---
    @@ -108,6 +108,28 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite
         verifyTaskLaunched(driver, "o2")
       }
     
    +  test("mesos declines offers from blacklisted slave") {
    +    setBackend()
    +
    +    // launches a task on a valid offer on slave s0
    +    val minMem = backend.executorMemory(sc) + 1024
    +    val minCpu = 4
    +    val offer1 = Resources(minMem, minCpu)
    +    offerResources(List(offer1))
    +    verifyTaskLaunched(driver, "o1")
    +
    +    // for any reason executor(aka mesos task) failed on s0
    +    val status = createTaskStatus("0", "s1", TaskState.TASK_FAILED)
    +    backend.statusUpdate(driver, status)
    +    when(taskScheduler.nodeBlacklist()).thenReturn(Set("s1"))
    +
    +    val offer2 = Resources(minMem, minCpu)
    +    // Launches a new task on a valid offer from the same slave
    +    offerResources(List(offer2))
    +    // but since it's blacklisted the offer is declined
    +    verifyDeclinedOffer(driver, createOfferId("o1"))
    --- End diff --
    
    will this actually pass?  I thought it wouldn't b/c the filtering is done inside `buildMesosTasks`, which never calls `declineOffer` on offers that fail `canLaunchTask`.  (a separate thing which needs fixing -- you could open another issue for that.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r196997885
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -648,14 +645,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
               totalGpusAcquired -= gpus
               gpusByTaskId -= taskId
             }
    -        // If it was a failure, mark the slave as failed for blacklisting purposes
             if (TaskState.isFailed(state)) {
    -          slave.taskFailures += 1
    -
    -          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
    -            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
    -                "is Spark installed on it?")
    -          }
    +          logError(s"Task $taskId failed on Mesos slave $slaveId.")
    --- End diff --
    
    @IgorBerman I'm not entirely sure what you mean.
    
    yes, *eventually* I think mesos should be doing something very simliar to whats in that PR.  You can't use that immediately, because for now the other PR is tied to yarn internals.  But I don't think it would be too hard to refactor what's there just a little bit so most of the logic could be reused.
    
    but I think everybody just wants to get this change in, and do that in a followup.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #91930 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91930/testReport)** for PR 20640 at commit [`5eda874`](https://github.com/apache/spark/commit/5eda874e1b9b05396c57413b743995201e02ec3d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito before you merge anything give me some time to have a look. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @susanxhuynh good point about changing default behavior.  I'd rather have the change so we have more unified behavior between mesos and other cluster managers.  But I have never run spark on mesos or even had much interaction with users of spark on mesos, so I will defer to others judgement.  Another option: we could leave the old behavior, unless a user sets spark.blacklist.enabled=true.  its a little wonky, but that also guarantees you always get some blacklisting.
    
    I've also been considering turning blacklisting on by default in spark 2.4.  So far I've had good feedback from users running it (though we'll get way more feedback when its on by default).
    
    btw, one hole in the general blacklisting is handling cases when the executors fail to even start: https://issues.apache.org/jira/browse/SPARK-16630
    
    would that be covered by mesos with the old code?  just want to make sure we aren't losing that ability.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87556 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87556/testReport)** for PR 20640 at commit [`ca4eba2`](https://github.com/apache/spark/commit/ca4eba2ac59eadd1c1f3c36a5960174bbc9e662f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @IgorBerman any thought on this comment? https://github.com/apache/spark/pull/20640#discussion_r191272487


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #91930 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91930/testReport)** for PR 20640 at commit [`5eda874`](https://github.com/apache/spark/commit/5eda874e1b9b05396c57413b743995201e02ec3d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87562 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87562/testReport)** for PR 20640 at commit [`918e066`](https://github.com/apache/spark/commit/918e066cbb661110b6779c558a5377147a5b1d1e).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    cc @attilapiros , you may be interested b/c of how this relates to SPARK-16630


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87564 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87564/testReport)** for PR 20640 at commit [`2f1db89`](https://github.com/apache/spark/commit/2f1db89de162a05a1cee4c1221de52eacbcbdb68).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r169556816
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -648,15 +645,6 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
               totalGpusAcquired -= gpus
               gpusByTaskId -= taskId
             }
    -        // If it was a failure, mark the slave as failed for blacklisting purposes
    -        if (TaskState.isFailed(state)) {
    -          slave.taskFailures += 1
    -
    -          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
    -            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
    --- End diff --
    
    @kayousterhout BlacklistTracker has it's own logging that is concerned with blacklisted nodes, won't it be enough? on the other hand, if blacklisting is disabled, which is default, then we will lose this information.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r195829448
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -648,14 +645,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
               totalGpusAcquired -= gpus
               gpusByTaskId -= taskId
             }
    -        // If it was a failure, mark the slave as failed for blacklisting purposes
             if (TaskState.isFailed(state)) {
    -          slave.taskFailures += 1
    -
    -          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
    -            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
    -                "is Spark installed on it?")
    -          }
    +          logError(s"Task $taskId failed on Mesos slave $slaveId.")
    --- End diff --
    
    @squito @felixcheung wdyt regarding adding almost same lines here as in https://github.com/apache/spark/pull/21068/files#diff-65ed0dbf413c9f48cfa8f6eed9f3f0d5R73


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @felixcheung sorry I missed something? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88486/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87564/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @susanxhuynh are you ok with that?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    I understand if you want to do something like this for yourself to unblock, but I think I'm -1 on merging this because of adding more configs just for a stopgap.
    
    but I think we agree on the right solution here -- if you post that I will try to review promptly since I've got this paged in (though it might take a bit to merge as we figure out how to test ...)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @felixcheung sure, no problem. I'll open jira, and will add comment


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    ping @IgorBerman 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by hantuzun <gi...@git.apache.org>.
Github user hantuzun commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Thanks a lot for the work here.
    
    Wouldn't it be better if we change the log in the last commit (https://github.com/apache/spark/pull/20640/commits/66ed5afae1a5b4856e84c805c7858d552f38b26a) like below? 
    ```diff
    - Mesos slave $slaveId failed
    + Task $taskId failed on Mesos slave $slaveId.
    ```
    
    I'm not sure about logging convention of the project but I feel like logging it as `error` could be an option.
    
    ---
    
    OOT: Having Spark 2.3 released, how could we have https://spark.apache.org/versioning-policy.html updated with Spark 2.4 release window?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito I'm with debug logging and see declines. Overall it's working. However there are different possible reasons for declines not necessary due to blacklisted node(Actually there is jira that talks about enriching this information about declines)
    
    Regarding "much effect" - I mean that killing executor manually is not necessary causing BlacklistTracker to add it to the blacklist(driver usually accepts offer from other slave), so chances that I'll cause enough failures on same executor are pretty low. Maybe I'll reduce all blacklist.max* to be 1, thus aggressively blacklisting only 1 failure. WDYT?
    
    Regarding "other bug", which is probably not a bug...My understanding was that in coarse grained mode only 1 executor possible per application per node, but might be I'm wrong.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito @IgorBerman let's move on with this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87563/testReport)** for PR 20640 at commit [`1eb701c`](https://github.com/apache/spark/commit/1eb701cdd07652d4f196bc960b7d8218c0c46ba0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @susanxhuynh I agree in case it is not enabled we can log failures as usual, but not for blacklisting as it is disabled it wouldnt make sense. User should have this option not to care. 
    
    > In the case where an executor fails before entering Spark code (for example, Mesos agent failed to create the sandbox), would it be detected?
    
    Good question forgot to mention this. In this scenario a task status update will be given
    eg. [REASON_CONTAINER_LAUNCH_FAILED](https://github.com/apache/mesos/blob/5e5a8102c3281db25a37157dac123b0ca546e030/docs/task-state-reasons.md)
    
    This is done implicitly [here](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L658)  in status update which then calls [removeExecutor ](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L728) which then sends a message to drivers point to remove the executor and then this [line](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L330) is called which then will calls another helper [method](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L536), which calls this [one](ht
 tps://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L595) and in there blacklist info is updated.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito updated


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by susanxhuynh <gi...@git.apache.org>.
Github user susanxhuynh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r169676540
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -648,15 +645,6 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
               totalGpusAcquired -= gpus
               gpusByTaskId -= taskId
             }
    -        // If it was a failure, mark the slave as failed for blacklisting purposes
    -        if (TaskState.isFailed(state)) {
    -          slave.taskFailures += 1
    -
    -          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
    -            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
    --- End diff --
    
    @IgorBerman Yes, in the default case, it would be nice to have this information about a task failing, especially if it fails repeatedly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87556 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87556/testReport)** for PR 20640 at commit [`ca4eba2`](https://github.com/apache/spark/commit/ca4eba2ac59eadd1c1f3c36a5960174bbc9e662f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @IgorBerman I saw the ticket. We need to cover mesos task failures because they are also pretty common.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r169500415
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
           cpus + totalCoresAcquired <= maxCores &&
           mem <= offerMem &&
           numExecutors < executorLimit &&
    -      slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
    +      !scheduler.nodeBlacklist().contains(slaveId) &&
    --- End diff --
    
    In other places it looks like the hostname is used in the blacklist - why does this check against the slaveId instead of the offerHostname?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r169413261
  
    --- Diff: resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala ---
    @@ -108,6 +108,28 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite
         verifyTaskLaunched(driver, "o2")
       }
     
    +  test("mesos declines offers from blacklisted slave") {
    +    setBackend()
    +
    +    // launches a task on a valid offer on slave s0
    +    val minMem = backend.executorMemory(sc) + 1024
    +    val minCpu = 4
    +    val offer1 = Resources(minMem, minCpu)
    +    offerResources(List(offer1))
    +    verifyTaskLaunched(driver, "o1")
    +
    +    // for any reason executor(aka mesos task) failed on s0
    +    val status = createTaskStatus("0", "s1", TaskState.TASK_FAILED)
    +    backend.statusUpdate(driver, status)
    +    when(taskScheduler.nodeBlacklist()).thenReturn(Set("s1"))
    +
    +    val offer2 = Resources(minMem, minCpu)
    +    // Launches a new task on a valid offer from the same slave
    +    offerResources(List(offer2))
    +    // but since it's blacklisted the offer is declined
    +    verifyDeclinedOffer(driver, createOfferId("o1"))
    --- End diff --
    
    ah nevermind.  I took another look at the code and now I see how this works


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r179012891
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
           cpus + totalCoresAcquired <= maxCores &&
           mem <= offerMem &&
           numExecutors < executorLimit &&
    -      slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
    +      !scheduler.nodeBlacklist().contains(offerHostname) &&
    --- End diff --
    
    I just want to make really sure everybody understands the big change in behavior here -- `nodeBlacklist()` currently *only* gets updated based on failures in *spark* tasks.  If a mesos task fails to even start -- that is, if a spark executor fails to launch on a node -- `nodeBlacklist` does not get updated.  So you could have a node that is misconfigured somehow, and you might end up repeatedly trying to launch executors on it after this changed, with the executor even failing to start each time.  That is even if you have blacklisting on.
    
    This is SPARK-16630 for the non-mesos case.  That is being actively worked on now -- however the work there will probably have to be yarn-specific, so there will still be followup work to get the same thing for mesos after that is in.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87563/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by susanxhuynh <gi...@git.apache.org>.
Github user susanxhuynh commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @skonto We should not remove the logging. The logging [here](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L194) is only available if blacklisting is enabled, but by default blacklisting is disabled. The BlacklistTracker object [is not created](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L748) if blacklisting is disabled. But, we might still want to see the log of executor failure in this case.
    
    In the case where an executor fails before entering Spark code (for example, Mesos agent failed to create the sandbox), would it be detected?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #88486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88486/testReport)** for PR 20640 at commit [`66ed5af`](https://github.com/apache/spark/commit/66ed5afae1a5b4856e84c805c7858d552f38b26a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r191272487
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
           cpus + totalCoresAcquired <= maxCores &&
           mem <= offerMem &&
           numExecutors < executorLimit &&
    -      slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
    +      !scheduler.nodeBlacklist().contains(offerHostname) &&
    --- End diff --
    
    maybe comment on this in the code here and add a JIRA for tracking?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    sure @squito .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87580/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    sure @skonto, great to have somebody more knowledgable on mesos taking a closer look at this.
    
    sorry @IgorBerman I promised a quick fix here, but have realized this is more complicated than we originally thought.  but the discussion is moving forward at least, so thanks for driving it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @skonto, @squito , @kayousterhout ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #88628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88628/testReport)** for PR 20640 at commit [`a7ff8cc`](https://github.com/apache/spark/commit/a7ff8cccd1b7e5564880c40c503c169c6bed46b9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @IgorBerman I actually think that https://github.com/apache/spark/pull/17619 is the right approach.  As @timout pointed out on that one, this functionality doesn't need to be covered in mesos specific code at all, as its covered by the BlacklistTracker.  I don't like introducing new configs when we don't really need them.  Other than this being less invasive, is there another advantage here?
    
    The modifications I suggested to that PR are relatively small -- I think its fine if you want to open a PR that is the original updated with my suggestions (credit still to @timout), as I'm not sure if they're still working on it.  (my fault too as there was such a long delay for a proper review.)
    
    While I had some open questions, I think its a clear improvement in any case.  I just need to get a little help on mesos testing, we can ask on the dev list.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito updated to your wording. Where do you see headers duplicated? 
    I'm going to test custom build that contains this patch(actually on top of 2.2.0) to verify that basic blacklisting works for executor level blacklisting and that offers are rejected up to timeout duration.
    So I'll update you about my findings/testing results


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87558/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r179012299
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
    @@ -648,14 +645,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
               totalGpusAcquired -= gpus
               gpusByTaskId -= taskId
             }
    -        // If it was a failure, mark the slave as failed for blacklisting purposes
             if (TaskState.isFailed(state)) {
    -          slave.taskFailures += 1
    -
    -          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
    -            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
    -                "is Spark installed on it?")
    -          }
    +          logError(s"Task $taskId failed on Mesos slave $slaveId.")
    --- End diff --
    
    minor: I think it would be nice to say "Mesos task $taskId...".  Maybe its obvious for those spending more time with mesos, but I find I'm easily confused by the difference between a mesos task and a spark task.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r195731123
  
    --- Diff: resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala ---
    @@ -108,6 +108,28 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite
         verifyTaskLaunched(driver, "o2")
       }
     
    +  test("mesos declines offers from blacklisted slave") {
    +    setBackend()
    +
    +    // launches a task on a valid offer on slave s1
    +    val minMem = backend.executorMemory(sc) + 1024
    +    val minCpu = 4
    +    val offer1 = Resources(minMem, minCpu)
    +    offerResources(List(offer1))
    +    verifyTaskLaunched(driver, "o1")
    +
    +    // for any reason executor(aka mesos task) failed on s1
    +    val status = createTaskStatus("0", "s1", TaskState.TASK_FAILED)
    +    backend.statusUpdate(driver, status)
    +    when(taskScheduler.nodeBlacklist()).thenReturn(Set("hosts1"))
    --- End diff --
    
    @squito reading the code here: https://github.com/apache/spark/pull/21068/files
    is there an option to update the info about blacklisted node when there is a mesos task failure.
    It is a bit inconvenient to lose such events and wait for spark tasks to fail which may never launch since you dont have any executors running anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    thanks @IgorBerman, description looks fine to me now, maybe I saw it wrong before.
    
    your test sounds pretty good to me ... you could turn on debug logging for MesosCoarseGrainedSchedulerBackend and look for these log msgs:
    
    https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L603
    
    What do you mean "it didn't have much effect" -- sounds like it did exactly the right thing?
    
    Sorry, I don't really understand description of the other bug you mentioned.  Why shouldn't it start a 2nd executor on the same slave for the same application?  That seems fine until you have enough failures for the node blacklisting to take effect.  There is also a small race (that is relatively benign) that would allow you to get a an executor on a node which you are in the middle of blacklisting.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    great, any other comment before merging?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87580/testReport)** for PR 20640 at commit [`e2ddc1b`](https://github.com/apache/spark/commit/e2ddc1be19e2f978df4fe84073aff3f5b46afe45).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito In general I agree with you, however current status of hardcoded max 2 failures is blocking, so I've created simple fix that will demand less integration testing. I can close it.
    I think in context of MesosCoarseGrainedSchedulerBackend the TaskSchedulerImpl called scheduler so to test if slaveId is blacklisted probably following is needed:
    {code}
    !scheduler.nodeBlacklist().contains(slaveId)
    {code}


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Hi Everyone, just wanted to let you know that SPARK-16630 is progressing now https://github.com/apache/spark/pull/21068 and after some discussion, the implementation will actually need to live inside the cluster manager code for yarn.  But Mesos should be able to do something very similar, in fact I suspect you could refactor a lot of that code so that its used by mesos as well -- in principle the allocation blacklist just needs to know when container allocation fails, and optionally the node count of the cluster.  I don't totally understand where things live in mesos and if the ApplicationMaster / Driver distinction is present there as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #91639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91639/testReport)** for PR 20640 at commit [`a7ff8cc`](https://github.com/apache/spark/commit/a7ff8cccd1b7e5564880c40c503c169c6bed46b9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito, yes failed to start mesos executors caused nodes(slaves) to be blacklisted


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by susanxhuynh <gi...@git.apache.org>.
Github user susanxhuynh commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    I'm in favor of merging this. The hardcoded limit is pretty bad - particularly for streaming jobs; it would be preferable to remove it ASAP even though it may not be a complete solution.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito ok, I've cherry-picked @ 2 commits, so the history will contain his contribution and added on top of it commit with your proposition, hope now it's ok


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    thanks for updating.  can you also update the PR description?
    
    yeah its fine to just update this one.  You can't in general update others' prs, unless they give you push permissions to their repos.  You *can* start from their branch, and then add your changes on top -- that is a little preferable as that way the commit history includes their work, so when someone merges its a bit more obvious.  If you can adjust this PR that way, that would be nice -- but otherwise its OK, I will try to remember to adjust attribution when merging


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito so I've run internal workload with blacklisting on, with killing executors for the framework manually. It didn't have much effect(only once got blacklisting 1 executor and then after timeout it removed it)
    If you have some idea how to test it properly I can try to do it specifically aiming to this scenario of rejecting offers from mesos master for blacklisted slave resource.
    
    ps: not strictly connected to this conversation, but as a side note: I've probably found bug that running in coarse grained mode on mesos + dyn.allocation on + blacklisting on(not sure if it's relevant) + client mode and killing manually executors on some slaves makes spark driver to start 2nd executor on same slave for the same application(which is against constraints of coarse grained mode?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87564/testReport)** for PR 20640 at commit [`2f1db89`](https://github.com/apache/spark/commit/2f1db89de162a05a1cee4c1221de52eacbcbdb68).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    ok hmm ... so actually this change would lose some important functionality then.  unfortunately I don't have a clear picture yet of how to solve SPARK-16630 along with the other blacklisting.
    
    sorry this might actually need a more general solution, need to think about it a bit more ...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Jenkins, ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito, @susanxhuynh  no worries, I have my internal patch.
    
    As a side note, I'm starting to think that my first proposition of introducing new config that will just override this hardcoded constant(2) will be most pragmatic approach in this situation. It's not clear to me when SPARK-16630 will be resolved(open from 1.6). On the other hand, the price would be to introduce new config(maybe without documenting it) for people that need fix which will be deprecated as soon as SPARK-16630 + this change-set will provide as better systematic solution.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87562/testReport)** for PR 20640 at commit [`918e066`](https://github.com/apache/spark/commit/918e066cbb661110b6779c558a5377147a5b1d1e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87562/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito @IgorBerman a few points:
    a) This should be configurable. SPARK-16630 is not closed, it seems also kubernetes needs to support checking for blacklisted nodes when launching tasks. From what I see [here ](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L172) though, the CoarseGrainedSchedulerBackend checks if the executor is acceptable when it registers back to the backend. So if it is launched on a black listed node it is removed immediately. Of course we want to fail fast and thus, in kubernetes case we should also have code to exclude the blacklisted nodes when launching pods. See [here](https://github.com/kubernetes/kubernetes/issues/14573) and [here](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/). This is essentially the same concept like in [yarn](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/yarn/src/main/
 scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L127-L133).
     
    b) I see TaskSchedulerImpl a higher level abstraction compared CoarseGrainedSchedulerBackend (this is subclassed by all major backends).  The thing is blacklist info is kept in TaskSchedulerImpl and all backends update it implicitly in different paths.  CoarseGrainedSchedulerBackend updates it when there is a new status update for [example](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L122). At the end of this path [handleFailedTask](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L772) is called which updates the related info.
    A more interesting path for the discussion is when an executor disconnects and this is where [Yarn](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L213) and [Kubernetes](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L428) override the driver's endpoint to set [criteria](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L356-L366) for when a executor is considered lost and task failure should be taken into consideration for blacklisting. The reason you want this is preemption, or for mesos ,besides that, could be a scenario where we have an agent restarting ([process is restarte
 d](http://mesos.apache.org/documentation/latest/agent-recovery/)) and then executors are killed (framework checkpointing not enabled, eg. when upgrading). In the latter case if I restart my agent process I dont want to blacklist it, my slave is ready to go as soon as it reconnects to master. I also dont want to wait for blacklisting to expire. 
    
    To move forward, I suggest:
    a) remove the handling logic for blacklisting in mesos. Logging is done when a node is blacklisted [here](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L194).  
    b) override logic at driver's point when there is a [need](http://mesos.apache.org/documentation/latest/oversubscription/) for doing so. 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #88628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88628/testReport)** for PR 20640 at commit [`a7ff8cc`](https://github.com/apache/spark/commit/a7ff8cccd1b7e5564880c40c503c169c6bed46b9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87558/testReport)** for PR 20640 at commit [`2240259`](https://github.com/apache/spark/commit/22402594645b3ee14106da61cb401d6555ba2e1b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @squito ok, I can do this small fix to the PR of @timout. How can I update his PR/his code? Or I can use this one? WDYT?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91911/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #87580 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87580/testReport)** for PR 20640 at commit [`e2ddc1b`](https://github.com/apache/spark/commit/e2ddc1be19e2f978df4fe84073aff3f5b46afe45).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    **[Test build #91639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91639/testReport)** for PR 20640 at commit [`a7ff8cc`](https://github.com/apache/spark/commit/a7ff8cccd1b7e5564880c40c503c169c6bed46b9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by IgorBerman <gi...@git.apache.org>.
Github user IgorBerman commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    @skonto added logging when tasks failed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    lgtm
    
    @IgorBerman can you cleanup the PR description a little?  headers got duplicated.  And I'd reword a bit to something like
    
    > This updates the Mesos scheduler to integrate with the common logic for node-blacklisting across all cluster managers in BlacklistTracker.  Specifically, it removes a hardcoded MAX_SLAVE_FAILURES = 2 in MesosCoarseGrainedSchedulerBackend, and uses the blacklist from the BlacklistTracker, as Yarn does.
    
    > This closes https://github.com/apache/spark/pull/17619
    
    (the last "this closes" bit is useful for some tools we have, it will close the other one when this is merged.)
    
    thanks for doing this.  I will probably leave this open for a bit if more mesos users have thoughts


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20640
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org