You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by jo...@apache.org on 2016/04/22 03:07:17 UTC

mesos git commit: Clarified and improved documentation for maintenance primitives.

Repository: mesos
Updated Branches:
  refs/heads/master 90f7645cc -> f72d2e200


Clarified and improved documentation for maintenance primitives.

Review: https://reviews.apache.org/r/46456/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/f72d2e20
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/f72d2e20
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/f72d2e20

Branch: refs/heads/master
Commit: f72d2e20052b47b597e23df9a2fffc95dbd2ad8d
Parents: 90f7645
Author: Neil Conway <ne...@gmail.com>
Authored: Thu Apr 21 18:06:53 2016 -0700
Committer: Joris Van Remoortere <jo...@gmail.com>
Committed: Thu Apr 21 18:07:05 2016 -0700

----------------------------------------------------------------------
 docs/endpoints/master/maintenance/schedule.md |   2 +-
 docs/maintenance.md                           | 142 ++++++++++++---------
 src/master/http.cpp                           |   2 +-
 3 files changed, 84 insertions(+), 62 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/f72d2e20/docs/endpoints/master/maintenance/schedule.md
----------------------------------------------------------------------
diff --git a/docs/endpoints/master/maintenance/schedule.md b/docs/endpoints/master/maintenance/schedule.md
index 577e43c..3d69900 100644
--- a/docs/endpoints/master/maintenance/schedule.md
+++ b/docs/endpoints/master/maintenance/schedule.md
@@ -15,7 +15,7 @@ Returns or updates the cluster's maintenance schedule.
 GET: Returns the current maintenance schedule as JSON.
 
 POST: Validates the request body as JSON
-  and updates the maintenance schedule.
+and updates the maintenance schedule.
 
 
 ### AUTHENTICATION ###

http://git-wip-us.apache.org/repos/asf/mesos/blob/f72d2e20/docs/maintenance.md
----------------------------------------------------------------------
diff --git a/docs/maintenance.md b/docs/maintenance.md
index b86ccf6..1f2b14f 100644
--- a/docs/maintenance.md
+++ b/docs/maintenance.md
@@ -12,7 +12,7 @@ For example:
 
 * Hardware repair
 * Kernel upgrades
-* Agent upgrades (e.g. adjusting agent attributes or resources)
+* Agent upgrades (e.g., adjusting agent attributes or resources)
 
 Frameworks require visibility into any actions that disrupt cluster operation
 in order to meet Service Level Agreements or to ensure uninterrupted services
@@ -25,29 +25,28 @@ frameworks and operator.
 ## Terminology
 
 For the purpose of this document, an "Operator" is a person, tool, or script
-which manages the Mesos cluster.
+that manages a Mesos cluster.
 
-Maintenance primitives add several new concepts to the Mesos cluster.
-Those concepts are:
+Maintenance primitives add several new concepts to Mesos. Those concepts are:
 
-* **Maintenance** - An operation that makes resources on a machine unavailable,
+* **Maintenance**: An operation that makes resources on a machine unavailable,
   either temporarily or permanently.
-* **Maintenance window** - A set of machines and an associated interval during
+* **Maintenance window**: A set of machines and an associated time interval during
   which some maintenance is planned on those machines.
-* **Maintenance schedule** - A list of maintenance windows.
+* **Maintenance schedule**: A list of maintenance windows.
   A single machine may only appear in a schedule once.
-* **Unavailability** - An operator-specified interval, defined by a start time
+* **Unavailability**: An operator-specified interval, defined by a start time
   and duration, during which an associated machine may become unavailable.
   In general, no assumptions should be made about the availability of the
   machine (or resources) after the unavailability.
-* **Drain** - An interval between the scheduling of maintenance and when the
+* **Drain**: An interval between the scheduling of maintenance and when the
   machine(s) become unavailable.  Offers sent with resources from draining
   machines will contain unavailability information.  Frameworks running on
   draining machines will receive inverse offers (see next).  Frameworks
   utilizing resources on affected machines are expected either to take
   preemptive steps to prepare for the unavailability; or to communicate the
   framework's inability to conform to the maintenance schedule.
-* **Inverse offer** - A communication mechanism for the master to ask for
+* **Inverse offer**: A communication mechanism for the master to ask for
   resources back from a framework.  This notifies frameworks about any
   unavailability and gives frameworks a mechanism to respond about their
   capability to comply.  Inverse offers are similar to offers in that they
@@ -70,20 +69,27 @@ the maintenance schedule.
 
 ### Scheduling maintenance
 
-A machine is transitioned from Up mode into Draining mode as soon as it
-is scheduled for maintenance.  To transition a machine into Draining mode,
-an operator constructs a maintenance schedule and posts it to the Mesos master.
+A machine is transitioned from Up mode into Draining mode as soon as it is
+scheduled for maintenance.  To transition a machine into Draining mode, an
+operator constructs a maintenance schedule as a JSON document and posts it to
+[/maintenance/schedule](endpoints/master/maintenance/schedule.md) HTTP endpoint on
+the Mesos master. A given Mesos master has a single maintenance schedule;
+posting a new schedule replaces the previous schedule, if any.
 
 See the definition of a [maintenance::Schedule](https://github.com/apache/mesos/blob/016b02d7ed5a65bcad9261a133c8237c2df66e6e/include/mesos/maintenance/maintenance.proto#L48-L67)
 and of [Unavailability](https://github.com/apache/mesos/blob/016b02d7ed5a65bcad9261a133c8237c2df66e6e/include/mesos/v1/mesos.proto#L140-L154).
 
-In a production environment, the schedule must be constructed to ensure that
+In a production environment, the schedule should be constructed to ensure that
 enough agents are operational at any given point in time to ensure
 uninterrupted service by the frameworks.
 
-For example, in a cluster of three machines, the operator can schedule two
+For example, in a cluster of three machines, the operator might schedule two
 machines for one hour of maintenance, followed by another hour for the last
-machine.  The timestamps for unavailability are in nanoseconds since the epoch.
+machine.  The timestamps for unavailability are expressed in nanoseconds since
+the Unix epoch (note that making reliable use of maintenance primitives requires
+that the system clocks of all machines in the cluster are roughly synchronized,
+including any machines hosting frameworks).
+
 The schedule might look like:
 
 ```
@@ -111,19 +117,21 @@ The schedule might look like:
 }
 ```
 
-The operator then posts the schedule to the master's maintenance endpoints.
+The operator can then post the schedule to the master's
+[/maintenance/schedule](endpoints/master/maintenance/schedule.md) endpoint:
 
 ```
-curl http://localhost:5050/master/maintenance/schedule
+curl http://localhost:5050/maintenance/schedule
   -H "Content-type: application/json"
   -X POST
   -d @schedule.json
 ```
 
-The machines in a maintenance schedule do not necessarily need to be registered
-with the Mesos master.  The operator may add a machine to the maintenance
-schedule prior to launching an agent on the machine.  For example, this is
-useful for preventing a faulty machine from launching an agent on boot.
+The machines in a maintenance schedule do not need to be registered with the
+Mesos master at the time when the schedule is set.  The operator may add a
+machine to the maintenance schedule prior to launching an agent on the machine.
+For example, this can be useful to prevent a faulty machine from launching an
+agent on boot.
 
 **Note**: Each machine in the maintenance schedule should have as
 complete information as possible.  In order for Mesos to recognize an agent
@@ -146,12 +154,14 @@ The master checks that a maintenance schedule has the following properties:
   from Down mode to Up mode.
 
 If any of these properties are not met, the maintenance schedule is rejected
-with a corresponding error message and the master's state does not change.
+with a corresponding error message and the master's state is not changed.
 
-To update a maintenance schedule, the operator should first read the existing
-schedule, make the necessary changes, and then post the modified schedule.
+To update the maintenance schedule, the operator should first read the current
+schedule, make any necessary changes, and then post the modified schedule. The
+current maintenance schedule can be obtained by sending a GET request to the
+master's `maintenance/schedule` endpoint.
 
-To cancel a maintenance schedule, the operator should post an empty schedule.
+To cancel the maintenance schedule, the operator should post an empty schedule.
 
 ### Draining mode
 
@@ -163,27 +173,26 @@ As soon as a schedule is posted to the Mesos master, the following things occur:
   mode.  The mode of each machine is also persisted in the replicated log.
 * All frameworks using resources on affected agents are immediately
   notified.  Existing offers from the affected agents are rescinded
-  and re-sent with additional unavailability data.  All Frameworks using
+  and re-sent with additional unavailability data.  All frameworks using
   resources from the affected agents are given inverse offers.
 * New offers from the affected agents will also include
   the additional unavailability data.
 
 With this additional information, frameworks should perform scheduling in a
 maintenance-aware fashion.  Inverse offers communicate the frameworks' ability
-to conform to the maintenance schedule.
-For example:
+to conform to the maintenance schedule. For example:
 
 * A framework with long-running tasks may choose agents with no unavailability
   or with unavailability further in the future.
 * A datastore may choose to start a new replica if one of its agents is
-  scheduled for extensive maintenance or decommissioning.  If the datastore
+  scheduled for maintenance or decommissioning.  If the datastore
   can reasonably copy data into a new agent before maintenance,
   it would accept any inverse offers.  Otherwise, it would decline them.
-* A stateful task, on an eminently unavailable agent, may be migrated to
-  another available agent.  If the framework has sufficient resources to do
+* A stateful task on an agent with an impending unavailability may be migrated
+  to another available agent.  If the framework has sufficient resources to do
   so, it would accept any inverse offers.  Otherwise, it would decline them.
 
-Accepting an inverse offer indicates that the framework is ok with the
+Accepting an inverse offer indicates that the framework is okay with the
 maintenance schedule as it currently stands, given the current state of
 the framework's resources.  The master and operator should perceive acceptance
 as a best-effort promise by the framework to free all the resources contained
@@ -197,18 +206,18 @@ the viability of the maintenance schedule.  The filter for inverse offers is
 identical to the existing mechanism for re-offering offers to frameworks.
 
 **Note**: Accepting or declining an inverse offer does not result in
-immediate changes in the maintenance schedule, or in the way Mesos acts.
-Inverse offers only represent some extra information that frameworks may
-find useful. In the same manner, a rejection or acceptance of an offer is a
-hint for an operator. The operator may or may not chose to take that hint
+immediate changes in the maintenance schedule or in the way Mesos acts.
+Inverse offers only represent extra information that frameworks may
+find useful. In the same manner, rejecting or accepting an inverse offer is a
+hint for an operator. The operator may or may not choose to take that hint
 into account.
 
 ### Starting maintenance
 
-The operator starts maintenance by posting a list of machines to the master's
-maintenance endpoint.
+The operator starts maintenance by posting a list of machines to the
+[/machine/down](endpoints/master/machine/down.md) HTTP endpoint.
 
-See the definition of a [MachineID](https://github.com/apache/mesos/blob/016b02d7ed5a65bcad9261a133c8237c2df66e6e/include/mesos/v1/mesos.proto#L157-L167).
+The list of machines is specified in JSON format; each element of the list is a [MachineID](https://github.com/apache/mesos/blob/016b02d7ed5a65bcad9261a133c8237c2df66e6e/include/mesos/v1/mesos.proto#L157-L167).
 
 For example, to start maintenance on two machines:
 
@@ -220,7 +229,7 @@ For example, to start maintenance on two machines:
 ```
 
 ```
-curl http://localhost:5050/master/machine/down
+curl http://localhost:5050/machine/down
   -H "Content-type: application/json"
   -X POST
   -d @machines.json
@@ -236,31 +245,31 @@ The master checks that a list of machines has the following properties:
 * All listed machines must be present in the schedule.
 
 If any of these properties are not met, the list of machines is rejected
-with a corresponding error message and the master's state does not change.
+with a corresponding error message and the master's state is not changed.
 
 The operator can start maintenance on any machine that is scheduled for
 maintenance.  Machines that are not scheduled for maintenance cannot be
 directly transitioned from Up mode into Down mode.  However, the operator
-may schedule a machine for maintenance with a timestamp of the current
-time or in the past; and then immediately start maintenance on that machine.
+may schedule a machine for maintenance with a timestamp equal to the current
+time or in the past, and then immediately start maintenance on that machine.
 
-It is up to the operator to transition a machine from Draining to Deactivated
-mode.  Mesos will keep a machine in Draining mode even if the unavailability
-window arrives or passes.  This means that the operation of the machine is not
-disrupted in any way and offers (with unavailability information) are still
-sent for this machine.
+The operator must explicitly transition a machine from Draining to Deactived
+mode. That is, Mesos will keep a machine in Draining mode even if the
+unavailability window arrives or passes.  This means that the operation of the
+machine is not disrupted in any way and offers (with unavailability information)
+are still sent for this machine.
 
 When maintenance is triggered by the operator, all agents on the machine are
-told to shutdown.  These agents are subsequently removed from the master
+told to shutdown.  These agents are subsequently removed from the master,
 which causes tasks to be updated as `TASK_LOST`.  Any agents from
 machines in maintenance are also prevented from registering with the master.
 
 ### Completing maintenance
 
-When maintenance is complete, or if maintenance needs to be cancelled,
+When maintenance is complete or if maintenance needs to be cancelled,
 the operator can stop maintenance.  The process is very similar
-to starting maintenance (same validation criterion as the previous section).
-The operator posts a list of machines to the master's endpoints:
+to starting maintenance (same validation criteria as the previous section).
+The operator posts a list of machines to the master's [/machine/up](endpoints/master/machine/up.md) endpoint:
 
 ```
 [
@@ -270,7 +279,7 @@ The operator posts a list of machines to the master's endpoints:
 ```
 
 ```
-curl http://localhost:5050/master/machine/up
+curl http://localhost:5050/machine/up
   -H "Content-type: application/json"
   -X POST
   -d @machines.json
@@ -279,11 +288,24 @@ curl http://localhost:5050/master/machine/up
 **Note**: The duration of the maintenance, as indicated by the "unavailability"
 field, is a best-effort guess made by the operator.  Stopping maintenance
 before the end of the unavailability interval is allowed, as is stopping
-maintenance after the end of the unavailability interval.  The machines are
+maintenance after the end of the unavailability interval.  Machines are
 never automatically transitioned out of maintenance.
 
-Frameworks are informed about the completion or cancellation of maintenance
-when offers from that machine start being sent.  There is no explicit mechanism
-for notifying frameworks when maintenance is stopped.  After maintenance is
-stopped, new offers are no longer tagged with unavailability and inverse offers
-are no longer sent.  Also, new agents can start to register from the machine.
+Frameworks are informed about the completion or cancellation of maintenance when
+offers from that machine start being sent.  There is no explicit mechanism for
+notifying frameworks when maintenance is stopped.  After maintenance is stopped,
+new offers are no longer tagged with unavailability and inverse offers are no
+longer sent.  Also, agents running on the machine will be allowed to register
+with the Mesos master.
+
+### Viewing maintenance status
+
+The current maintenance status (Up, Draining, or Down) of each machine in the
+cluster can be viewed by accessing the master's
+[/maintenance/status](endpoints/master/maintenance/status.md) HTTP endpoint. For
+each machine that is Draining, this endpoint also includes the frameworks' responses to
+inverse offers for resources on that machine. For more information, see the
+format of the [ClusterStatus message](https://github.com/apache/mesos/blob/fa36917dd142f66924c5f7ed689b87d5ceabbf79/include/mesos/maintenance/maintenance.proto#L73-L84).
+
+>NOTE: The format of the data returned by this endpoint may change in a
+future release of Mesos.

http://git-wip-us.apache.org/repos/asf/mesos/blob/f72d2e20/src/master/http.cpp
----------------------------------------------------------------------
diff --git a/src/master/http.cpp b/src/master/http.cpp
index a9cb99a..de06985 100644
--- a/src/master/http.cpp
+++ b/src/master/http.cpp
@@ -2054,7 +2054,7 @@ string Master::Http::MAINTENANCE_SCHEDULE_HELP()
         "GET: Returns the current maintenance schedule as JSON.",
         "",
         "POST: Validates the request body as JSON",
-        "  and updates the maintenance schedule."),
+        "and updates the maintenance schedule."),
     AUTHENTICATION(true));
 }