You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2014/06/12 21:31:02 UTC

[jira] [Updated] (MESOS-1474) Provide cluster maintenance primitives for operators.

     [ https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler updated MESOS-1474:
-----------------------------------

    Description: 
Normally cluster upgrades can be done seamlessly using the built-in slave recovery feature. However, there are situations where operators want to be able to perform destructive maintenance operations on machines:

* Non-recoverable slave upgrades.
* Machine reboots.
* Kernel upgrades.
* etc.

In these situations, best practice is to perform rolling maintenance in large batches of machines. This can be problematic for frameworks when many related tasks are located within a batch of machines going for maintenance.

There are a few primitives of interest here:

* Provide a way for operators to fully shutdown a slave (killing all tasks underneath it).
* Provide a way for operators to mark specific slaves as undergoing maintenance. This means that no more offers are being sent for these slaves, and no new tasks will launch on them.
* Provide a way for frameworks to be notified when resources are requested to be relinquished. This gives the framework to proactively move a task before it is forcibly killed. It also allows the automation of operations like: "please drain these slaves within 1 hour."

  was:
Normally cluster upgrades can be done seamlessly using the built-in slave recovery feature. However, there are situations where operators want to be able to perform destructive maintenance operations on machines:

* Non-recoverable slave upgrades.
* Machine reboots.
* Kernel upgrades.
* etc.

In these situations, best practice is to perform rolling maintenance in large batches of machines. This can be problematic for frameworks when many related tasks are located within a batch of machines going for maintenance.

There are a few primitives of interest here:

* Provide a way for operators to fully shutdown a slave (killing all tasks underneath it).
* Provide a way for operators to mark specific slaves as undergoing maintenance. This means that no more offers are being sent for these slaves, and no new tasks will launch on them.
* Provide a way for frameworks to be notified when resources are requested to be relinquished. This gives the framework to proactively move a task before it is forcibly killed. It also allows the automation of operations like: "please drain and shutdown these slaves within 1 hour."


> Provide cluster maintenance primitives for operators.
> -----------------------------------------------------
>
>                 Key: MESOS-1474
>                 URL: https://issues.apache.org/jira/browse/MESOS-1474
>             Project: Mesos
>          Issue Type: Epic
>          Components: framework, master, slave
>            Reporter: Benjamin Mahler
>
> Normally cluster upgrades can be done seamlessly using the built-in slave recovery feature. However, there are situations where operators want to be able to perform destructive maintenance operations on machines:
> * Non-recoverable slave upgrades.
> * Machine reboots.
> * Kernel upgrades.
> * etc.
> In these situations, best practice is to perform rolling maintenance in large batches of machines. This can be problematic for frameworks when many related tasks are located within a batch of machines going for maintenance.
> There are a few primitives of interest here:
> * Provide a way for operators to fully shutdown a slave (killing all tasks underneath it).
> * Provide a way for operators to mark specific slaves as undergoing maintenance. This means that no more offers are being sent for these slaves, and no new tasks will launch on them.
> * Provide a way for frameworks to be notified when resources are requested to be relinquished. This gives the framework to proactively move a task before it is forcibly killed. It also allows the automation of operations like: "please drain these slaves within 1 hour."



--
This message was sent by Atlassian JIRA
(v6.2#6252)