You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgummelt <gi...@git.apache.org> on 2017/02/23 00:40:54 UTC

[GitHub] spark pull request #17031: [SPARK-19702] Add suppress/revive support to the ...

GitHub user mgummelt opened a pull request:

    https://github.com/apache/spark/pull/17031

    [SPARK-19702] Add suppress/revive support to the Mesos Spark Dispatcher

    ## What changes were proposed in this pull request?
    
    Adds suppress/revive support to the Mesos Spark Dispatcher to prevent starving other frameworks.  See JIRA for details.  The majority of the lines changed in this PR are superficial refactoring to fix up the `MesosClusterScheduler` class, which was rife with poor naming and code organization.  The meat of the changes are pointed out in the comments.
    
    The Dispatcher should be suppressed when there are no drivers queued nor pending retry.  Whenever the queues defining these two sets are modified, we may potentially need to call `suppressOffers()` or `reviveOffers()`.  We only do so if we aren't already suppressed or revived, respectively.  Strictly speaking, we can never know if we are suppressed or revived, because remote driver calls don't guarantee delivery.  In the low probability event that a revive call fails, the scheduler may think it's revived, when really it's suppressed.  This could result in starvation.  The operator would have to manually restart the dispatcher, at which time the dispatcher would again call `reviveOffers()`.  The only way to fix this generally is to implement some periodic timer that calls `reviveOffers()` if there are queued/pending drivers to be scheduled.  This can be chatty and complicates the code, so I haven't implemented it here.
    
    ## How was this patch tested?
    
    Unit tests, Manual testing, and Mesos/Spark integration test suite
    
    cc @susanxhuynh @skonto @jmlvanre


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mesosphere/spark SPARK-19702-suppress-revive

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17031.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17031
    
----
commit 98604f833e055cdac13506d9122caad8a5ff0a89
Author: Michael Gummelt <mg...@mesosphere.io>
Date:   2017-02-22T23:30:39Z

    Add suppress/revive support to the Mesos Spark Dispatcher

commit a16a4297131f1d4529569509b597e3178ad60d93
Author: Michael Gummelt <mg...@mesosphere.io>
Date:   2017-02-23T00:29:49Z

    add tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #74020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74020/testReport)** for PR 17031 at commit [`b5fb61e`](https://github.com/apache/spark/commit/b5fb61e67efc54f3bf586f036fdc4bc1b1f4fa4e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @srown There are parts for refactoring only purposes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103287098
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala ---
    @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
     import scala.collection.mutable.{HashMap, HashSet}
    --- End diff --
    
    ok cool!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73531/testReport)** for PR 17031 at commit [`b6e3205`](https://github.com/apache/spark/commit/b6e32059021ce03eb35d45c28a326c9c477a92e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r102612931
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
    @@ -737,13 +735,75 @@ private[spark] class MesosClusterScheduler(
         if (index != -1) {
           pendingRetryDrivers.remove(index)
           pendingRetryDriversState.expunge(id)
    +      suppressOrRevive()
           true
         } else {
           false
         }
       }
     
    -  def getQueuedDriversSize: Int = queuedDrivers.size
    -  def getLaunchedDriversSize: Int = launchedDrivers.size
    -  def getPendingRetryDriversSize: Int = pendingRetryDrivers.size
    +  private def copyBuffer(buffer: ArrayBuffer[MesosDriverDescription]):
    +      ArrayBuffer[MesosDriverDescription] = {
    +    val newBuffer = new ArrayBuffer[MesosDriverDescription](buffer.size)
    +    buffer.copyToBuffer(newBuffer)
    +    newBuffer
    +  }
    +
    +  /**
    +   * Check if the task state is a recoverable state that we can relaunch the task.
    +   * Task state like TASK_ERROR are not relaunchable state since it wasn't able
    +   * to be validated by Mesos.
    +   */
    +  private def isFailure(state: MesosTaskState): Boolean = {
    +    state == MesosTaskState.TASK_FAILED ||
    +      state == MesosTaskState.TASK_LOST
    +  }
    +
    +  private def shouldSuppress: Boolean = {
    +    return queuedDrivers.isEmpty && pendingRetryDrivers.isEmpty
    +  }
    +
    +  private def suppressOrRevive(): Unit = {
    --- End diff --
    
    This is the meat of the functionality change.  We call this whenever the state of `queuedDrivers` or `pendingRetryDrivers` has changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73307/testReport)** for PR 17031 at commit [`42636b9`](https://github.com/apache/spark/commit/42636b992155cedccef3d7cdad0ccccf2080347d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @srowen Yes, most of the code is refactoring that I came across when solving this.  If that's going to delay this being merged, please let me know and I can remove the refactoring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    "The only way to fix this generally is to implement some periodic timer that calls reviveOffers() if there are queued/pending drivers to be scheduled. This can be chatty and complicates the code, so I haven't implemented it here."
    Shouldn't we only check if we actually get any offers from the master lately and call reviveOffers() if not? We could have a backoff approach here... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Given the concerns about the dispatcher being stuck in a suppressed state, I'm going to solve this a different way.  I'm going to increase the default offer decline timeout to 120s and make it configurable, just like it is in the driver.  This will make it so that the offer will be offered to 120 other frameworks before circling back to the dispatcher, rather than the default 5.  I'll also keep the explicit revive calls when a new driver is submitted or an existing one fails, which immediately causes offers to be re-offered to the dispatcher.
    
    This removes the risk that the driver gets stuck in a suppressed state, because the dispatcher never suppresses itself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103261951
  
    --- Diff: resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSuite.scala ---
    @@ -48,45 +48,50 @@ class MesosClusterSchedulerSuite extends SparkFunSuite with LocalSparkContext wi
         }
     
         driver = mock[SchedulerDriver]
    -    scheduler = new MesosClusterScheduler(
    -      new BlackHoleMesosClusterPersistenceEngineFactory, conf) {
    -      override def start(): Unit = { ready = true }
    +    val persistenceFactory = new BlackHoleMesosClusterPersistenceEngineFactory()
    +    scheduler = new MesosClusterScheduler(persistenceFactory, conf) {
    +      override def start(): Unit = { this.ready = true }
         }
         scheduler.start()
    +    scheduler.registered(driver, Utils.TEST_FRAMEWORK_ID, Utils.TEST_MASTER_INFO)
    +
    +    verify(driver, times(1)).suppressOffers()
    --- End diff --
    
    Here this call has no result right, no suppress.We add the driver and then at the end this called is made, so the queue is not empty, correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73303/testReport)** for PR 17031 at commit [`a16a429`](https://github.com/apache/spark/commit/a16a4297131f1d4529569509b597e3178ad60d93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @srowen To support increasing the default, I've had to:
    - make refuse_seconds configurable
    - factor out `declineOffer` so the dispatcher can use it in addition to the coarse grained scheduler.
    - persist the `schedulerDriver` in both the dispatcher scheduler and the coarse grained scheduler so we can access it in callbacks that aren't passed the driver object.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @skonto Cassandra supports suppress/revive https://github.com/mesosphere/dcos-cassandra-service/blob/master/cassandra-scheduler/src/main/java/com/mesosphere/dcos/cassandra/scheduler/CassandraScheduler.java#L423
    
    I can't speak for *all* the frameworks in Universe, Cassandra and Kafka both support suppress revive, and everything built with the `DefaultScheduler` in dcos-commons gets it for free: https://github.com/mesosphere/dcos-commons/blob/master/sdk/scheduler/src/main/java/com/mesosphere/sdk/scheduler/DefaultScheduler.java#L838



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Ok I see. LGTM. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Your understanding is correct.  You must set refuse_seconds for all your frameworks to some value N, such that N >= #frameworks.  So for this change, if some operator is running >120 frameworks, they may need to configure this value.  However, I'm not aware of any Mesos cluster on Earth running that many frameworks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74020/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73883/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @srowen Just to move things along, I removed everything not directly relevant to this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73531/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #74020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74020/testReport)** for PR 17031 at commit [`b5fb61e`](https://github.com/apache/spark/commit/b5fb61e67efc54f3bf586f036fdc4bc1b1f4fa4e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73307/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Ok like the Cassandra case you mean right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @skonto It should be clear in the logs.  As long as you have at least INFO logs enabled, you'll see "Suppressing offers." in the logs, and little or nothing after, since the offer cycles stop.  Unfortunately, Mesos doesn't expose the suppressed state of frameworks, so you can't glean this from state.json.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    It depends on the application.  It's the amount of time you have to wait before having the opportunity to use those resources again.  But if you explicitly revive, which we do here whenever we need more resources, then it doesn't matter.  We could set it to infinity and still never be starved, because we'll always get another shot at the resources when we revive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73883/testReport)** for PR 17031 at commit [`ba864d0`](https://github.com/apache/spark/commit/ba864d0f86700bb51bebe53ead95854a57a02361).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73883/testReport)** for PR 17031 at commit [`ba864d0`](https://github.com/apache/spark/commit/ba864d0f86700bb51bebe53ead95854a57a02361).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @skonto updated the description.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @mgummelt LGTM. Thanks fo rthe clarifications. @srowen can we get a merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by susanxhuynh <gi...@git.apache.org>.
Github user susanxhuynh commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    If we're concerned about the lost reviveOffer() and don't want to handle that corner case, do we want to document it somewhere for operators? "If jobs aren't running and you see [...] in the logs, do this".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702] Add suppress/revive support to the ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r102707406
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala ---
    @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
     import scala.collection.mutable.{HashMap, HashSet}
    --- End diff --
    
    Do we still support Fine Grained?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r102774738
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala ---
    @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
     import scala.collection.mutable.{HashMap, HashSet}
    --- End diff --
    
    It's deprecated, but I had to make some changes to it just to compile.  I hope to completely remove by 2.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73303/testReport)** for PR 17031 at commit [`a16a429`](https://github.com/apache/spark/commit/a16a4297131f1d4529569509b597e3178ad60d93).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Compared to the title, this looks like a significant change, still. Is the intent something different from the JIRA? this doens't just increase a default. I don't have any opinion on the changes, just commenting on the consistency of change vs discussion and paper trail


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103281366
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
    @@ -737,13 +735,75 @@ private[spark] class MesosClusterScheduler(
         if (index != -1) {
           pendingRetryDrivers.remove(index)
           pendingRetryDriversState.expunge(id)
    +      suppressOrRevive()
           true
         } else {
           false
         }
       }
     
    -  def getQueuedDriversSize: Int = queuedDrivers.size
    -  def getLaunchedDriversSize: Int = launchedDrivers.size
    -  def getPendingRetryDriversSize: Int = pendingRetryDrivers.size
    +  private def copyBuffer(buffer: ArrayBuffer[MesosDriverDescription]):
    +      ArrayBuffer[MesosDriverDescription] = {
    +    val newBuffer = new ArrayBuffer[MesosDriverDescription](buffer.size)
    +    buffer.copyToBuffer(newBuffer)
    +    newBuffer
    +  }
    +
    +  /**
    +   * Check if the task state is a recoverable state that we can relaunch the task.
    +   * Task state like TASK_ERROR are not relaunchable state since it wasn't able
    +   * to be validated by Mesos.
    +   */
    +  private def isFailure(state: MesosTaskState): Boolean = {
    +    state == MesosTaskState.TASK_FAILED ||
    +      state == MesosTaskState.TASK_LOST
    +  }
    +
    +  private def shouldSuppress: Boolean = {
    +    return queuedDrivers.isEmpty && pendingRetryDrivers.isEmpty
    --- End diff --
    
    return is redundant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103280133
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/ui/MesosClusterPage.scala ---
    @@ -32,7 +32,7 @@ private[mesos] class MesosClusterPage(parent: MesosClusterUI) extends WebUIPage(
       private val historyServerURL = parent.conf.get(HISTORY_SERVER_URL)
     
       def render(request: HttpServletRequest): Seq[Node] = {
    -    val state = parent.scheduler.getSchedulerState()
    +    val state = parent.scheduler.getSchedulerState
     
         val driverHeader = Seq("Driver ID")
         val historyHeader = historyServerURL.map(url => Seq("History")).getOrElse(Nil)
    --- End diff --
    
    Since you are refactoring the code s/url/_.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @mgummelt do we wan to keep the suppress/revive technique?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    How the operator should know about starvation? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73531/testReport)** for PR 17031 at commit [`b6e3205`](https://github.com/apache/spark/commit/b6e32059021ce03eb35d45c28a326c9c477a92e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @mgummelt Yes they should look at the logs but how do they know this is something that requires action from their side and not a cluster issue or anything else. It should be documented since it is manual.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103281854
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
    @@ -582,141 +688,33 @@ private[spark] class MesosClusterScheduler(
         }
       }
     
    -  override def resourceOffers(driver: SchedulerDriver, offers: JList[Offer]): Unit = {
    -    logTrace(s"Received offers from Mesos: \n${offers.asScala.mkString("\n")}")
    -    val tasks = new mutable.HashMap[OfferID, ArrayBuffer[TaskInfo]]()
    -    val currentTime = new Date()
    -
    -    val currentOffers = offers.asScala.map {
    -      o => new ResourceOffer(o.getId, o.getSlaveId, o.getResourcesList)
    -    }.toList
    -
    -    stateLock.synchronized {
    -      // We first schedule all the supervised drivers that are ready to retry.
    -      // This list will be empty if none of the drivers are marked as supervise.
    -      val driversToRetry = pendingRetryDrivers.filter { d =>
    -        d.retryState.get.nextRetry.before(currentTime)
    -      }
    -
    -      scheduleTasks(
    -        copyBuffer(driversToRetry),
    -        removeFromPendingRetryDrivers,
    -        currentOffers,
    -        tasks)
    -
    -      // Then we walk through the queued drivers and try to schedule them.
    -      scheduleTasks(
    -        copyBuffer(queuedDrivers),
    -        removeFromQueuedDrivers,
    -        currentOffers,
    -        tasks)
    -    }
    -    tasks.foreach { case (offerId, taskInfos) =>
    -      driver.launchTasks(Collections.singleton(offerId), taskInfos.asJava)
    -    }
    -
    -    for (o <- currentOffers if !tasks.contains(o.offerId)) {
    -      driver.declineOffer(o.offerId)
    -    }
    -  }
    -
    -  private def copyBuffer(
    -      buffer: ArrayBuffer[MesosDriverDescription]): ArrayBuffer[MesosDriverDescription] = {
    -    val newBuffer = new ArrayBuffer[MesosDriverDescription](buffer.size)
    -    buffer.copyToBuffer(newBuffer)
    -    newBuffer
    -  }
    -
    -  def getSchedulerState(): MesosClusterSchedulerState = {
    -    stateLock.synchronized {
    -      new MesosClusterSchedulerState(
    -        frameworkId,
    -        masterInfo.map(m => s"http://${m.getIp}:${m.getPort}"),
    -        copyBuffer(queuedDrivers),
    -        launchedDrivers.values.map(_.copy()).toList,
    -        finishedDrivers.map(_.copy()).toList,
    -        copyBuffer(pendingRetryDrivers))
    -    }
    -  }
    -
    -  override def offerRescinded(driver: SchedulerDriver, offerId: OfferID): Unit = {}
    -  override def disconnected(driver: SchedulerDriver): Unit = {}
    -  override def reregistered(driver: SchedulerDriver, masterInfo: MasterInfo): Unit = {
    -    logInfo(s"Framework re-registered with master ${masterInfo.getId}")
    -  }
    -  override def slaveLost(driver: SchedulerDriver, slaveId: SlaveID): Unit = {}
    -  override def error(driver: SchedulerDriver, error: String): Unit = {
    -    logError("Error received: " + error)
    -    markErr()
    -  }
    +  private def createTaskInfo(desc: MesosDriverDescription, offer: ResourceOffer): TaskInfo = {
    +    val taskId = TaskID.newBuilder().setValue(desc.submissionId).build()
     
    -  /**
    -   * Check if the task state is a recoverable state that we can relaunch the task.
    -   * Task state like TASK_ERROR are not relaunchable state since it wasn't able
    -   * to be validated by Mesos.
    -   */
    -  private def shouldRelaunch(state: MesosTaskState): Boolean = {
    -    state == MesosTaskState.TASK_FAILED ||
    -      state == MesosTaskState.TASK_LOST
    -  }
    +    val (remainingResources, cpuResourcesToUse) =
    +      partitionResources(offer.resources, "cpus", desc.cores)
    +    val (finalResources, memResourcesToUse) =
    +      partitionResources(remainingResources.asJava, "mem", desc.mem)
    +    offer.resources = finalResources.asJava
     
    -  override def statusUpdate(driver: SchedulerDriver, status: TaskStatus): Unit = {
    -    val taskId = status.getTaskId.getValue
    -    stateLock.synchronized {
    -      if (launchedDrivers.contains(taskId)) {
    -        if (status.getReason == Reason.REASON_RECONCILIATION &&
    -          !pendingRecover.contains(taskId)) {
    -          // Task has already received update and no longer requires reconciliation.
    -          return
    -        }
    -        val state = launchedDrivers(taskId)
    -        // Check if the driver is supervise enabled and can be relaunched.
    -        if (state.driverDescription.supervise && shouldRelaunch(status.getState)) {
    -          removeFromLaunchedDrivers(taskId)
    -          state.finishDate = Some(new Date())
    -          val retryState: Option[MesosClusterRetryState] = state.driverDescription.retryState
    -          val (retries, waitTimeSec) = retryState
    -            .map { rs => (rs.retries + 1, Math.min(maxRetryWaitTime, rs.waitTime * 2)) }
    -            .getOrElse{ (1, 1) }
    -          val nextRetry = new Date(new Date().getTime + waitTimeSec * 1000L)
    -
    -          val newDriverDescription = state.driverDescription.copy(
    -            retryState = Some(new MesosClusterRetryState(status, retries, nextRetry, waitTimeSec)))
    -          pendingRetryDrivers += newDriverDescription
    -          pendingRetryDriversState.persist(taskId, newDriverDescription)
    -        } else if (TaskState.isFinished(mesosToTaskState(status.getState))) {
    -          removeFromLaunchedDrivers(taskId)
    -          state.finishDate = Some(new Date())
    -          if (finishedDrivers.size >= retainedDrivers) {
    -            val toRemove = math.max(retainedDrivers / 10, 1)
    -            finishedDrivers.trimStart(toRemove)
    -          }
    -          finishedDrivers += state
    -        }
    -        state.mesosTaskStatus = Option(status)
    -      } else {
    -        logError(s"Unable to find driver $taskId in status update")
    -      }
    -    }
    +    val appName = desc.conf.get("spark.app.name")
    +    val taskInfo = TaskInfo.newBuilder()
    +      .setTaskId(taskId)
    +      .setName(s"Driver for ${appName}")
    --- End diff --
    
    brackets are redundant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    But this time is the refuse time correct? As stated here:
    https://issues.apache.org/jira/browse/MESOS-3202 I have 30 seconds for osme other framework to accept resources in the list otherwise the first one will be asked again. So implicitly if you take into consideration the master delay, this value limits the number of frameworks that will be asked for the offer declined from cassandra (assuming cassandra is the first one in the list). So if you have many frameworks in that list at last some will starve. So this should be a large value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @skonto @susanxhuynh I've updated the solution to use a longer (120s) default refuse timeout, instead of suppressing offers.  Please re-review.  Just as the previous refuse seconds settings were undocumented, I've left this one undocumented.  Users should almost never need to customize it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @skonto I completely agree that this is a cluster-wide issue, but unfortunately that's the state of things.  In the long-term, optimistic offers in Mesos should fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @susanxhuynh I don't think it's worth documenting.  It should be clear in the logs, which should be where an operator turns if they notice no jobs are launching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @susanxhuynh Mesos/Spark integration tests: https://github.com/typesafehub/mesos-spark-integration-tests.  We run them as a subset of DC/OS Spark integration tests: https://github.com/mesosphere/spark-build/blob/master/tests/test.py#L89
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73303/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @srowen ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Add suppress/revive support to the ...

Posted by susanxhuynh <gi...@git.apache.org>.
Github user susanxhuynh commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    The suppress / revive logic LGTM. I didn't look that closely at the refactoring changes. Where are the Mesos/Spark integration tests that you mentioned? @mgummelt 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103303283
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
    @@ -737,13 +735,75 @@ private[spark] class MesosClusterScheduler(
         if (index != -1) {
           pendingRetryDrivers.remove(index)
           pendingRetryDriversState.expunge(id)
    +      suppressOrRevive()
           true
         } else {
           false
         }
       }
     
    -  def getQueuedDriversSize: Int = queuedDrivers.size
    -  def getLaunchedDriversSize: Int = launchedDrivers.size
    -  def getPendingRetryDriversSize: Int = pendingRetryDrivers.size
    +  private def copyBuffer(buffer: ArrayBuffer[MesosDriverDescription]):
    +      ArrayBuffer[MesosDriverDescription] = {
    +    val newBuffer = new ArrayBuffer[MesosDriverDescription](buffer.size)
    +    buffer.copyToBuffer(newBuffer)
    +    newBuffer
    +  }
    +
    +  /**
    +   * Check if the task state is a recoverable state that we can relaunch the task.
    +   * Task state like TASK_ERROR are not relaunchable state since it wasn't able
    +   * to be validated by Mesos.
    +   */
    +  private def isFailure(state: MesosTaskState): Boolean = {
    +    state == MesosTaskState.TASK_FAILED ||
    +      state == MesosTaskState.TASK_LOST
    +  }
    +
    +  private def shouldSuppress: Boolean = {
    +    return queuedDrivers.isEmpty && pendingRetryDrivers.isEmpty
    +  }
    +
    +  private def suppressOrRevive(): Unit = {
    +    if (shouldSuppress && !isSuppressed) {
    +      logInfo("Suppressing Offers.")
    +      driver.suppressOffers()
    +      isSuppressed = true
    +    } else if (!shouldSuppress && isSuppressed) {
    +      logInfo("Reviving Offers.")
    +      driver.reviveOffers()
    +      isSuppressed = false
    +    }
    +  }
    +
    +  /**
    +   * Escape args for Unix-like shells, unless already quoted by the user.
    +   * Based on: http://www.gnu.org/software/bash/manual/html_node/Double-Quotes.html
    +   * and http://www.grymoire.com/Unix/Quote.html
    +   *
    +   * @param value argument
    +   * @return escaped argument
    +   */
    +  private[scheduler] def shellEscape(value: String): String = {
    +    val WrappedInQuotes = """^(".+"|'.+')$""".r
    +    val ShellSpecialChars = (""".*([ '<>&|\?\*;!#\\(\)"$`]).*""").r
    --- End diff --
    
    see above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702][MESOS] Increase default refuse_seconds tim...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    @skonto Any other concerns?  Can I get a LGTM?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17031: [SPARK-19702] Add suppress/revive support to the Mesos S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17031
  
    **[Test build #73307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73307/testReport)** for PR 17031 at commit [`42636b9`](https://github.com/apache/spark/commit/42636b992155cedccef3d7cdad0ccccf2080347d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by mgummelt <gi...@git.apache.org>.
Github user mgummelt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103303266
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
    @@ -582,141 +688,33 @@ private[spark] class MesosClusterScheduler(
         }
       }
     
    -  override def resourceOffers(driver: SchedulerDriver, offers: JList[Offer]): Unit = {
    -    logTrace(s"Received offers from Mesos: \n${offers.asScala.mkString("\n")}")
    -    val tasks = new mutable.HashMap[OfferID, ArrayBuffer[TaskInfo]]()
    -    val currentTime = new Date()
    -
    -    val currentOffers = offers.asScala.map {
    -      o => new ResourceOffer(o.getId, o.getSlaveId, o.getResourcesList)
    -    }.toList
    -
    -    stateLock.synchronized {
    -      // We first schedule all the supervised drivers that are ready to retry.
    -      // This list will be empty if none of the drivers are marked as supervise.
    -      val driversToRetry = pendingRetryDrivers.filter { d =>
    -        d.retryState.get.nextRetry.before(currentTime)
    -      }
    -
    -      scheduleTasks(
    -        copyBuffer(driversToRetry),
    -        removeFromPendingRetryDrivers,
    -        currentOffers,
    -        tasks)
    -
    -      // Then we walk through the queued drivers and try to schedule them.
    -      scheduleTasks(
    -        copyBuffer(queuedDrivers),
    -        removeFromQueuedDrivers,
    -        currentOffers,
    -        tasks)
    -    }
    -    tasks.foreach { case (offerId, taskInfos) =>
    -      driver.launchTasks(Collections.singleton(offerId), taskInfos.asJava)
    -    }
    -
    -    for (o <- currentOffers if !tasks.contains(o.offerId)) {
    -      driver.declineOffer(o.offerId)
    -    }
    -  }
    -
    -  private def copyBuffer(
    -      buffer: ArrayBuffer[MesosDriverDescription]): ArrayBuffer[MesosDriverDescription] = {
    -    val newBuffer = new ArrayBuffer[MesosDriverDescription](buffer.size)
    -    buffer.copyToBuffer(newBuffer)
    -    newBuffer
    -  }
    -
    -  def getSchedulerState(): MesosClusterSchedulerState = {
    -    stateLock.synchronized {
    -      new MesosClusterSchedulerState(
    -        frameworkId,
    -        masterInfo.map(m => s"http://${m.getIp}:${m.getPort}"),
    -        copyBuffer(queuedDrivers),
    -        launchedDrivers.values.map(_.copy()).toList,
    -        finishedDrivers.map(_.copy()).toList,
    -        copyBuffer(pendingRetryDrivers))
    -    }
    -  }
    -
    -  override def offerRescinded(driver: SchedulerDriver, offerId: OfferID): Unit = {}
    -  override def disconnected(driver: SchedulerDriver): Unit = {}
    -  override def reregistered(driver: SchedulerDriver, masterInfo: MasterInfo): Unit = {
    -    logInfo(s"Framework re-registered with master ${masterInfo.getId}")
    -  }
    -  override def slaveLost(driver: SchedulerDriver, slaveId: SlaveID): Unit = {}
    -  override def error(driver: SchedulerDriver, error: String): Unit = {
    -    logError("Error received: " + error)
    -    markErr()
    -  }
    +  private def createTaskInfo(desc: MesosDriverDescription, offer: ResourceOffer): TaskInfo = {
    +    val taskId = TaskID.newBuilder().setValue(desc.submissionId).build()
     
    -  /**
    -   * Check if the task state is a recoverable state that we can relaunch the task.
    -   * Task state like TASK_ERROR are not relaunchable state since it wasn't able
    -   * to be validated by Mesos.
    -   */
    -  private def shouldRelaunch(state: MesosTaskState): Boolean = {
    -    state == MesosTaskState.TASK_FAILED ||
    -      state == MesosTaskState.TASK_LOST
    -  }
    +    val (remainingResources, cpuResourcesToUse) =
    +      partitionResources(offer.resources, "cpus", desc.cores)
    +    val (finalResources, memResourcesToUse) =
    +      partitionResources(remainingResources.asJava, "mem", desc.mem)
    +    offer.resources = finalResources.asJava
     
    -  override def statusUpdate(driver: SchedulerDriver, status: TaskStatus): Unit = {
    -    val taskId = status.getTaskId.getValue
    -    stateLock.synchronized {
    -      if (launchedDrivers.contains(taskId)) {
    -        if (status.getReason == Reason.REASON_RECONCILIATION &&
    -          !pendingRecover.contains(taskId)) {
    -          // Task has already received update and no longer requires reconciliation.
    -          return
    -        }
    -        val state = launchedDrivers(taskId)
    -        // Check if the driver is supervise enabled and can be relaunched.
    -        if (state.driverDescription.supervise && shouldRelaunch(status.getState)) {
    -          removeFromLaunchedDrivers(taskId)
    -          state.finishDate = Some(new Date())
    -          val retryState: Option[MesosClusterRetryState] = state.driverDescription.retryState
    -          val (retries, waitTimeSec) = retryState
    -            .map { rs => (rs.retries + 1, Math.min(maxRetryWaitTime, rs.waitTime * 2)) }
    -            .getOrElse{ (1, 1) }
    -          val nextRetry = new Date(new Date().getTime + waitTimeSec * 1000L)
    -
    -          val newDriverDescription = state.driverDescription.copy(
    -            retryState = Some(new MesosClusterRetryState(status, retries, nextRetry, waitTimeSec)))
    -          pendingRetryDrivers += newDriverDescription
    -          pendingRetryDriversState.persist(taskId, newDriverDescription)
    -        } else if (TaskState.isFinished(mesosToTaskState(status.getState))) {
    -          removeFromLaunchedDrivers(taskId)
    -          state.finishDate = Some(new Date())
    -          if (finishedDrivers.size >= retainedDrivers) {
    -            val toRemove = math.max(retainedDrivers / 10, 1)
    -            finishedDrivers.trimStart(toRemove)
    -          }
    -          finishedDrivers += state
    -        }
    -        state.mesosTaskStatus = Option(status)
    -      } else {
    -        logError(s"Unable to find driver $taskId in status update")
    -      }
    -    }
    +    val appName = desc.conf.get("spark.app.name")
    +    val taskInfo = TaskInfo.newBuilder()
    +      .setTaskId(taskId)
    +      .setName(s"Driver for ${appName}")
    --- End diff --
    
    The brackets are consistent with our other format strings.  I'm not trying to refactor all the code in this PR, btw.  I just touched the code whose poor style was hindering my ability to solve the problem related to this PR.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Increase default refuse_seco...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17031


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17031: [SPARK-19702][MESOS] Add suppress/revive support ...

Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17031#discussion_r103280797
  
    --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
    @@ -737,13 +735,75 @@ private[spark] class MesosClusterScheduler(
         if (index != -1) {
           pendingRetryDrivers.remove(index)
           pendingRetryDriversState.expunge(id)
    +      suppressOrRevive()
           true
         } else {
           false
         }
       }
     
    -  def getQueuedDriversSize: Int = queuedDrivers.size
    -  def getLaunchedDriversSize: Int = launchedDrivers.size
    -  def getPendingRetryDriversSize: Int = pendingRetryDrivers.size
    +  private def copyBuffer(buffer: ArrayBuffer[MesosDriverDescription]):
    +      ArrayBuffer[MesosDriverDescription] = {
    +    val newBuffer = new ArrayBuffer[MesosDriverDescription](buffer.size)
    +    buffer.copyToBuffer(newBuffer)
    +    newBuffer
    +  }
    +
    +  /**
    +   * Check if the task state is a recoverable state that we can relaunch the task.
    +   * Task state like TASK_ERROR are not relaunchable state since it wasn't able
    +   * to be validated by Mesos.
    +   */
    +  private def isFailure(state: MesosTaskState): Boolean = {
    +    state == MesosTaskState.TASK_FAILED ||
    +      state == MesosTaskState.TASK_LOST
    +  }
    +
    +  private def shouldSuppress: Boolean = {
    +    return queuedDrivers.isEmpty && pendingRetryDrivers.isEmpty
    +  }
    +
    +  private def suppressOrRevive(): Unit = {
    +    if (shouldSuppress && !isSuppressed) {
    +      logInfo("Suppressing Offers.")
    +      driver.suppressOffers()
    +      isSuppressed = true
    +    } else if (!shouldSuppress && isSuppressed) {
    +      logInfo("Reviving Offers.")
    +      driver.reviveOffers()
    +      isSuppressed = false
    +    }
    +  }
    +
    +  /**
    +   * Escape args for Unix-like shells, unless already quoted by the user.
    +   * Based on: http://www.gnu.org/software/bash/manual/html_node/Double-Quotes.html
    +   * and http://www.grymoire.com/Unix/Quote.html
    +   *
    +   * @param value argument
    +   * @return escaped argument
    +   */
    +  private[scheduler] def shellEscape(value: String): String = {
    +    val WrappedInQuotes = """^(".+"|'.+')$""".r
    +    val ShellSpecialChars = (""".*([ '<>&|\?\*;!#\\(\)"$`]).*""").r
    --- End diff --
    
    Parentheses are redundant. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org