You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by al...@apache.org on 2017/09/07 11:29:40 UTC
mesos git commit: Add documentation for possible task reasons.
Repository: mesos
Updated Branches:
refs/heads/master dee5f5ebd -> d6c0448dd
Add documentation for possible task reasons.
Review: https://reviews.apache.org/r/61495/
Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/d6c0448d
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/d6c0448d
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/d6c0448d
Branch: refs/heads/master
Commit: d6c0448dd1b4623d73d82159c71dfb9f13dcc999
Parents: dee5f5e
Author: Benno Evers <be...@yandex-team.ru>
Authored: Thu Sep 7 13:25:48 2017 +0200
Committer: Alexander Rukletsov <al...@apache.org>
Committed: Thu Sep 7 13:28:57 2017 +0200
----------------------------------------------------------------------
docs/home.md | 1 +
docs/task-state-reasons.md | 470 ++++++++++++++++++++++++++++++++++++++
include/mesos/mesos.proto | 1 +
include/mesos/v1/mesos.proto | 1 +
4 files changed, 473 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/mesos/blob/d6c0448d/docs/home.md
----------------------------------------------------------------------
diff --git a/docs/home.md b/docs/home.md
index ad91f2f..dcb6235 100644
--- a/docs/home.md
+++ b/docs/home.md
@@ -68,6 +68,7 @@ layout: documentation
* [Javadoc](/api/latest/java/) documents the old Java API.
* [Operator HTTP API](operator-http-api.md) describes the new HTTP API for communication between operators and Mesos master/agent.
* [Scheduler HTTP API](scheduler-http-api.md) describes the new HTTP API for communication between schedulers and the Mesos master.
+* [Task State Reasons](task-state-reasons.md) describes how task state reasons are used in Mesos.
* [Versioning](versioning.md) describes HTTP API and release versioning.
http://git-wip-us.apache.org/repos/asf/mesos/blob/d6c0448d/docs/task-state-reasons.md
----------------------------------------------------------------------
diff --git a/docs/task-state-reasons.md b/docs/task-state-reasons.md
new file mode 100644
index 0000000..07f7927
--- /dev/null
+++ b/docs/task-state-reasons.md
@@ -0,0 +1,470 @@
+---
+title: Apache Mesos - Task State Reasons
+layout: documentation
+---
+
+# Task State Reasons
+
+Some TaskStatus messages will arrive with the `reason` field set to a value
+that can allow frameworks to display better error messages and to implement
+special behaviour for some of the reasons.
+
+For most reasons, the `message` field of the TaskStatus message will give a
+more detailed, human-readable error description.
+
+Not all status updates will contain a reason.
+
+
+# Guidelines for Framework Authors
+
+Frameworks that implement their own executors are free to set the reason field
+on any status messages they produce.
+
+Note that executors can not generally rely on the fact that the scheduler will
+see the status update with the reason set by the executor, since only the
+latest update for each different task state is stored and re-transmitted. See
+in particular the description of `REASON_RECONCILIATION` below.
+
+Most reasons describe conditions that can only be detected in the master or
+agent code, and will accompany automatically generated status updates from
+either of these.
+
+For consistency with the existing usages of the different task reasons, we
+recommend that executors restrict themselves to the following subset if they
+use a non-default reason in their status updates.
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_TASK_CHECK_STATUS_UPDATED
+</code></td><td> For executors that support running task checks, it is
+ recommended to generate a status update with this reason
+ every time the task check status changes, together with a
+ human-readable description of the change in
+ the <code> message </code> field.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_HEALTH_CHECK_STATUS_UPDATED
+</code></td><td> For executors that support running task health checks, it
+ is recommended to generate a status update with this reason
+ every time the health check status changes, together with a
+ human-readable description of the change in
+ the <code> message </code> field.
+<strong> Note:
+</strong> The built-in executors additionally send an update with
+ this reason every time a health check is unhealthy.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_INVALID
+</code></td><td> For executors that implement their own task validation
+ logic, this reason can be used when the validation check
+ fails, together with a human-readable description of the
+ failed check in the <code> message </code> field.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_UNAUTHORIZED
+</code></td><td> For executors that implement their own authorization logic,
+ this reason can be used when authorization fails, together
+ with a human-readable description in
+ the <code> message </code> field.
+</td></tr></table>
+
+
+
+# Reference of Reasons Currently Used in Mesos
+
+## Deprecated Reasons
+
+The reason `REASON_COMMAND_EXECUTOR_FAILED` is deprecated and will be removed
+in the future. It should not be referenced by newly written code.
+
+
+## Unused Reasons
+
+The reasons `REASON_CONTAINER_LIMITATION`, `REASON_INVALID_FRAMEWORKID`,
+`REASON_SLAVE_UNKNOWN`, `REASON_TASK_UNKNOWN` and
+`REASON_EXECUTOR_UNREGISTERED` are not used as of Mesos 1.4.
+
+
+## Reasons for Terminal Status Updates
+
+For these status updates, the reason indicates *why* the task state changed.
+Typically, a given reason will always appear together
+with the same state.
+
+Typically they are generated by mesos when an error occurs that prevents
+the executor from sending its own status update messages.
+
+Below, a partition-aware framework means a framework which has the
+`Capability::PARTITION_AWARE` capability bit set in its `FrameworkInfo`.
+Messages generated on the master will have the `source` field set to
+`SOURCE_MASTER` and messages generated on the agent will have it set
+to `SOURCE_AGENT` in the v1 API or `SOURCE_SLAVE` in the v0 API.
+
+As of Mesos 1.4, the following reasons are being used.
+
+
+### For state `TASK_FAILED`
+#### In status updates generated on the agent:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_CONTAINER_LAUNCH_FAILED
+</code></td><td> The task could not be launched because its container failed
+ to launch.
+</td></tr>
+
+
+<tr><td><code> REASON_CONTAINER_LIMITATION_MEMORY
+</code></td><td> The container in which the task was running exceeded its
+ memory allocation.
+</td></tr>
+
+
+<tr><td><code> REASON_CONTAINER_LIMITATION_DISK
+</code></td><td> The container in which the task was running exceeded its
+ disk quota.
+</td></tr>
+
+
+<tr> <td><code> REASON_IO_SWITCHBOARD_EXITED
+</code></td><td> The I/O switchboard server terminated unexpectedly.
+</td></tr>
+
+
+<tr><td><code> REASON_EXECUTOR_REGISTRATION_TIMEOUT
+</code></td><td> The executor for this task didn't register with the agent
+ within the allowed time limit.
+</td></tr>
+
+
+<tr><td><code> REASON_EXECUTOR_REREGISTRATION_TIMEOUT
+</code></td><td> The executor for this task lost connection and didn't
+ reregister within the allowed time limit.
+</td></tr>
+
+
+<tr><td><code> REASON_EXECUTOR_TERMINATED
+</code></td><td> The tasks' executor terminated abnormally, and no more
+ specific reason could be determined.
+</td></tr></table>
+
+
+
+### For state `TASK_KILLED`
+#### In status updates generated on the master:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_FRAMEWORK_REMOVED
+</code></td><td> The framework to which this task belonged was removed.
+<br/><strong> Note:
+</strong> The status update will be sent out before the task is
+ actually killed.
+</td></tr>
+<tr><td><code> REASON_TASK_KILLED_DURING_LAUNCH
+</code></td><td> This task, or a task within this task group, was killed
+ before delivery to the agent.
+</td></tr></table>
+
+
+#### In status updates generated on the agent:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_TASK_KILLED_DURING_LAUNCH
+</code></td><td> This task, or a task within this task group, was killed
+ before delivery to the executor.
+<br/><strong> Note:
+</strong> Prior to version 1.5, the agent would in this situation
+ sometimes send status updates with reason set
+ to <code> REASON_EXECUTOR_UNREGISTERED </code> and
+ sometimes without any reason set, depending on details of
+ the timing of the executor launch and the kill command.
+</td></tr></table>
+
+
+
+### For state `TASK_ERROR`
+#### In status updates generated on the master:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_TASK_INVALID
+</code></td><td> Task or resource validation checks failed.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_GROUP_INVALID
+</code></td><td> Task group or resource validation checks failed.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_UNAUTHORIZED
+</code></td><td> Task authorization failed on the master.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_GROUP_UNAUTHORIZED
+</code></td><td> Task group authorization failed on the master.
+</td></tr></table>
+
+
+#### In status updates generated on the agent:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_TASK_UNAUTHORIZED
+</code></td><td> Task authorization failed on the agent.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_GROUP_UNAUTHORIZED
+</code></td><td> Task group authorization failed on the agent.
+</td></tr></table>
+
+
+
+### For state `TASK_LOST`
+#### In status updates generated on the master:
+
+<table class="table table-striped">
+
+<tr><td rowspan="2">
+<code> REASON_SLAVE_DISCONNECTED
+</code></td><td> The agent on which the task was running disconnected, and
+ didn't reconnect in time.
+</td></tr>
+<tr><td> The task was part of an accepted offer, but the agent
+ sending the offer disconnected in the meantime.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_DROPPED </code> instead.
+</td></tr>
+
+
+<tr><td><code> REASON_MASTER_DISCONNECTED
+</code></td><td> The task was part of an accepted offer which couldn't be
+ sent to the master, because it was disconnected.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_DROPPED </code> instead.
+<br/><strong> Note:
+</strong> Despite the source being set to <code> SOURCE_MASTER </code>,
+ the message is not sent from the master but locally from
+ the scheduler driver.
+<strong> Note:
+</strong> This reason is only used in the v0 API.
+</td></tr>
+
+
+<tr><td rowspan="3">
+<code> REASON_SLAVE_REMOVED
+</code></td><td> The agent on which the task was running was removed.
+</td></tr>
+<tr><td> The task was part of an accepted offer, but the agent
+ sending the offer was disconnected in the meantime.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will be
+ to <code> TASK_DROPPED </code> instead.
+</td></tr>
+<tr><td> The agent on which the task was running was marked
+ unreachable.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_UNREACHABLE </code> instead.
+</td></tr>
+
+
+<tr><td><code> REASON_RESOURCES_UNKNOWN
+</code></td><td> The task was part of an accepted offer which used
+ checkpointed resources that are not known to the master.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_DROPPED </code> instead.
+</td></tr></table>
+
+
+#### In status updates generated on the agent:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_SLAVE_RESTARTED
+</code></td><td> The task was launched during an agent restart, and never
+ got forwarded to the executor.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_DROPPED </code> instead.
+</td></tr>
+
+
+<tr><td><code> REASON_CONTAINER_PREEMPTED
+</code></td><td> The container in which the task was running was pre-empted
+ by a QoS correction.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will be changed
+ to <code> TASK_GONE </code> instead.
+</td></tr>
+
+
+<tr><td><code> REASON_CONTAINER_UPDATE_FAILED
+</code></td><td> The container in which the task was running was discarded
+ because a resource update failed.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_GONE </code> instead.
+</td></tr>
+
+
+<tr><td><code> REASON_EXECUTOR_TERMINATED
+</code></td><td> The executor which was supposed to execute this task was
+ already terminated, or the agent receives an instruction to
+ kill the task before the executor was started.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_DROPPED </code> instead.
+</td></tr>
+
+
+<tr><td><code> REASON_GC_ERROR
+</code></td><td> A directory to be used by this task was scheduled for GC
+ and it could not be unscheduled.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_DROPPED </code> instead.
+</td></tr>
+
+
+<tr><td><code> REASON_INVALID_OFFERS
+</code></td><td> This task belonged to an accepted offer that didn't pass
+ validation checks.
+<br/><strong> Note:
+</strong> For partition-aware frameworks, the state will
+ be <code> TASK_DROPPED </code> instead.
+</td></tr></table>
+
+
+
+### For state `TASK_DROPPED`:
+#### In status updates generated on the master:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_SLAVE_DISCONNECTED
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr>
+
+
+<tr><td><code> REASON_SLAVE_REMOVED
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr>
+
+
+<tr><td><code> REASON_RESOURCES_UNKNOWN
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr></table>
+
+
+#### In status updates generated on the agent:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_SLAVE_RESTARTED
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr>
+
+
+<tr><td><code> REASON_GC_ERROR
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr>
+
+
+<tr><td><code> REASON_INVALID_OFFERS
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr></table>
+
+
+
+### For state `TASK_UNREACHABLE`:
+#### In status updates generated on the master:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_SLAVE_REMOVED
+</code></td><td> See <code> TASK_LOST <code>
+</td></tr></table>
+
+
+
+### For state `TASK_GONE`
+#### In status updates generated on the agent:
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_CONTAINER_UPDATE_FAILED
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr>
+
+<tr><td><code> REASON_CONTAINER_PREEMPTED
+</code></td><td> See <code> TASK_LOST </code>
+</td></tr>
+
+<tr><td><code> REASON_EXECUTOR_PREEMPTED
+</code></td><td> Renamed to <code> REASON_CONTAINER_PREEMPTED </code> in
+ Mesos 0.26.
+</td></tr></table>
+
+
+
+## Reasons for Non-Terminal Status Updates
+
+These reasons do not cause a state change, and will be sent along with the
+last known state of the task. The reason field indicates *why* the status
+update was sent.
+
+
+<table class="table table-striped">
+
+<tr><td><code> REASON_RECONCILIATION
+</code></td><td> A framework requested implicit or explicit reconciliation
+ for this task.
+<br/><strong> Note:
+</strong> Status updates with this reason are not the original ones,
+ but rather a modified copy that is re-sent from the master.
+ In particular, the original <code> data </code>
+ and <code> message </code> fields are erased and the
+ original <code> reason </code> field is overwritten
+ by <code> REASON_RECONCILIATION </code>.
+</td>
+
+<tr><td><code> REASON_TASK_CHECK_STATUS_UPDATED
+</code></td><td> A task check notified the agent that its state changed.
+<br/><strong> Note:
+</strong> This reason is set by the executor, so for tasks that are
+ running with a custom executor, whether or not status
+ updates with this reasons are sent depends on that
+ executors implementation.
+<strong> Note:
+</strong> Currently, when using one of the built-in executors, this
+ reason is only used within status updates with task
+ state <code> TASK_RUNNING </code>.
+</td></tr>
+
+
+<tr><td><code> REASON_TASK_HEALTH_CHECK_STATUS_UPDATED
+</code></td><td> A task health check notified the agent that its
+ state changed.
+<br/><strong> Note:
+</strong> This reason is set by the executor, so for tasks that are
+ running with a custom executor, whether or not status
+ updates with this reasons are sent depends on that
+ executors implementation.
+<strong> Note:
+</strong> Currently, when using one of the built-in executors, this
+ reason is only used within status updates with task
+ state <code> TASK_RUNNING </code>.
+</td></tr>
+
+</table>
http://git-wip-us.apache.org/repos/asf/mesos/blob/d6c0448d/include/mesos/mesos.proto
----------------------------------------------------------------------
diff --git a/include/mesos/mesos.proto b/include/mesos/mesos.proto
index 3b9a6fd..1c74cbb 100644
--- a/include/mesos/mesos.proto
+++ b/include/mesos/mesos.proto
@@ -2171,6 +2171,7 @@ message TaskStatus {
}
// Detailed reason for the task status update.
+ // Refer to docs/task-state-reasons.md for additional explanation.
//
// TODO(bmahler): Differentiate between slave removal reasons
// (e.g. unhealthy vs. unregistered for maintenance).
http://git-wip-us.apache.org/repos/asf/mesos/blob/d6c0448d/include/mesos/v1/mesos.proto
----------------------------------------------------------------------
diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
index 85de9e3..9aef05f 100644
--- a/include/mesos/v1/mesos.proto
+++ b/include/mesos/v1/mesos.proto
@@ -2154,6 +2154,7 @@ message TaskStatus {
}
// Detailed reason for the task status update.
+ // Refer to docs/task-state-reasons.md for additional explanation.
//
// TODO(bmahler): Differentiate between agent removal reasons
// (e.g. unhealthy vs. unregistered for maintenance).