You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by vi...@apache.org on 2016/01/06 02:09:46 UTC

mesos git commit: Added initial draft of executor HTTP API user doc.

Repository: mesos
Updated Branches:
  refs/heads/master 31a5fb076 -> e4a68e182


Added initial draft of executor HTTP API user doc.

Review: https://reviews.apache.org/r/41454/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/e4a68e18
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/e4a68e18
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/e4a68e18

Branch: refs/heads/master
Commit: e4a68e1827837cb459afc3e87a3d9edc335ffe4e
Parents: 31a5fb0
Author: Anand Mazumdar <ma...@gmail.com>
Authored: Tue Jan 5 16:57:25 2016 -0800
Committer: Vinod Kone <vi...@gmail.com>
Committed: Tue Jan 5 17:09:35 2016 -0800

----------------------------------------------------------------------
 docs/executor-http-api.md | 358 +++++++++++++++++++++++++++++++++++++++++
 docs/home.md              |   1 +
 2 files changed, 359 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/e4a68e18/docs/executor-http-api.md
----------------------------------------------------------------------
diff --git a/docs/executor-http-api.md b/docs/executor-http-api.md
new file mode 100644
index 0000000..fd80005
--- /dev/null
+++ b/docs/executor-http-api.md
@@ -0,0 +1,358 @@
+---
+layout: documentation
+---
+
+# Executor HTTP API
+
+Mesos 0.27.0 added **experimental** support for V1 Executor HTTP API.
+
+
+## Overview
+
+The executor interacts with Mesos via  "/api/v1/executor" endpoint hosted by the Mesos agent. The fully qualified URL of the endpoint might look like:
+
+    http://agenthost:5051/api/v1/executor
+
+Note that we refer to this endpoint with its suffix "/executor" in the rest of this document. This endpoint accepts HTTP POST requests with data encoded as JSON (Content-Type: application/json) or binary Protobuf (Content-Type: application/x-protobuf). The first request that the executor sends to "/executor" endpoint is called SUBSCRIBE and results in a streaming response ("200 OK" status code with Transfer-Encoding: chunked). **Executors are expected to keep the subscription connection open as long as possible (barring errors in network, agent process restarts, software bugs etc.) and incrementally process the response** (NOTE: HTTP client libraries that can only parse the response after the connection is closed cannot be used). For the encoding used, please refer to **Events** section below.
+
+All the subsequent (non subscribe) requests to "/executor" endpoint (see details below in **Calls** section) must be sent using a different connection(s) than the one being used for subscription. Agent responds to these HTTP POST requests with "202 Accepted" status codes (or, for unsuccessful requests, with 4xx or 5xx status codes; details in later sections). The "202 Accepted" response means that a request has been accepted for processing, not that the processing of the request has been completed. The request might or might not be acted upon by Mesos (e.g., agent fails during the processing of the request). Any asynchronous responses from these requests will be streamed on the long-lived subscription connection.
+
+## Calls
+
+The following calls are currently accepted by the agent. The canonical source of this information is [executor.proto](https://github.com/apache/mesos/blob/master/include/mesos/v1/executor/executor.proto) (NOTE: The protobuf definitions are subject to change before the Beta API is finalized). Note that when sending JSON encoded Calls, executors should encode raw bytes in Base64 and strings in UTF-8.
+
+### SUBSCRIBE
+
+This is the first step in the communication process between the executor and agent. This is also to be considered as subscription to the "/executor" events stream.
+
+To subscribe with the agent, the executor sends a HTTP POST request with encoded `SUBSCRIBE` message. The HTTP response is a stream with [RecordIO](scheduler-http-api.md#recordio-response-format) encoding, with the first event being `SUBSCRIBED` event (see details in **Events** section).
+
+Additionally, if the executor is connecting to the agent after a [disconnection](#disconnections), it can also send a list of:
+
+* **Unacknowledged Status Updates**: The executor is expected to maintain a list of status updates not acknowledged by the agent via the `ACKNOWLEDGE` events.
+* **Unacknowledged Tasks**: The executor is expected to maintain a list of tasks that have not been acknowledged by the agent. A task is considered acknowledged if atleast one of the status updates for this task is acknowledged by the slave.
+
+```
+SUBSCRIBE Request (JSON):
+
+POST /api/v1/executor  HTTP/1.1
+
+Host: agenthost:5051
+Content-Type: application/json
+Accept: application/json
+
+{
+  "type": "SUBSCRIBE",
+  "executor_id": {
+    "value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
+  },
+  "framework_id": {
+    "value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
+  },
+  "subscribe": {
+    "unacknowledged_tasks": [
+      {
+        "name": "dummy-task",
+        "task_id": {
+          "value": "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"
+        },
+        "agent_id": {
+          "value": "f1c9cdc5-195e-41a7-a0d7-adaa9af07f81"
+        },
+        "command": {
+          "value": "ls",
+          "arguments": [
+            "-l",
+            "\/tmp"
+          ]
+        }
+      }
+    ],
+    "unacknowledged_updates": [
+      {
+        "framework_id": {
+          "value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
+        },
+        "status": {
+          "source": "SOURCE_EXECUTOR",
+          "task_id": {
+            "value": "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"
+          },
+        "state": "TASK_RUNNING",
+        "uuid": "ZDQwZjNmM2UtYmJlMy00NGFmLWEyMzAtNGNiMWVhZTcyZjY3Cg=="
+        }
+      }
+    ]
+  }
+}
+
+SUBSCRIBE Response Event (JSON):
+HTTP/1.1 200 OK
+
+Content-Type: application/json
+Transfer-Encoding: chunked
+
+<event-length>
+{
+  "type": "SUBSCRIBED",
+  "subscribed": {
+    "executor_info": {
+      "executor_id": {
+        "value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
+      },
+      "command": {
+        "value": "\/path\/to\/executor"
+      },
+      "framework_id": {
+        "value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
+      }
+    },
+    "framework_info": {
+      "user": "foo",
+      "name": "my_framework"
+    },
+    "agent_id": {
+      "value": "f1c9cdc5-195e-41a7-a0d7-adaa9af07f81"
+    },
+    "agent_info": {
+      "host": "agenthost",
+      "port": 5051
+    }
+  }
+}
+<more events>
+```
+
+NOTE: Once an executor is launched, the agent waits for a duration of `--executor_registration_timeout` (configurable at agent startup) for the executor to subscribe. If the executor fails to subscribe within this duration, the agent forcefully destroys the container executor is running in.
+
+### UPDATE
+
+Sent by the executor to reliably communicate the state of managed tasks. It is crucial that a terminal update (e.g., `TASK_FINISHED`, `TASK_KILLED` or `TASK_FAILED`) is sent to the agent as soon as the task terminates, in order to allow Mesos to release the resources allocated to the task.
+
+The scheduler must explicitly respond to this call through an `ACKNOWLEDGE` message (see `ACKNOWLEDGED` in the Events section below for the semantics). The executor must maintain a list of unacknowledged updates. If for some reason, the executor is disconnected from the agent, these updates must be sent as part of `SUBSCRIBE` request in the `unacknowledged_updates` field.
+
+```
+UPDATE Request (JSON):
+
+POST /api/v1/executor  HTTP/1.1
+
+Host: agenthost:5051
+Content-Type: application/json
+Accept: application/json
+
+{
+  "executor_id": {
+    "value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
+  },
+  "framework_id": {
+    "value": "9aaa9d0d-e00d-444f-bfbd-23dd197939a0-0000"
+  },
+  "type": "UPDATE",
+  "update": {
+    "status": {
+      "executor_id": {
+        "value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
+      },
+      "source": "SOURCE_EXECUTOR",
+      "state": "TASK_RUNNING",
+      "task_id": {
+        "value": "66724cec-2609-4fa0-8d93-c5fb2099d0f8"
+      },
+      "uuid": "ZDQwZjNmM2UtYmJlMy00NGFmLWEyMzAtNGNiMWVhZTcyZjY3Cg=="
+    }
+  }
+}
+
+UPDATE Response:
+HTTP/1.1 202 Accepted
+```
+
+### MESSAGE
+
+Sent by the executor to send arbitrary binary data to the scheduler. Note that Mesos neither interprets this data nor makes any guarantees about the delivery of this message to the scheduler. The `data` field is raw bytes encoded in Base64.
+
+```
+MESSAGE Request (JSON):
+
+POST /api/v1/executor  HTTP/1.1
+
+Host: agenthost:5051
+Content-Type: application/json
+Accept: application/json
+
+{
+  "executor_id": {
+    "value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
+  },
+  "framework_id": {
+    "value": "9aaa9d0d-e00d-444f-bfbd-23dd197939a0-0000"
+  },
+  "type": "MESSAGE",
+  "data": "t+Wonz5fRFKMzCnEptlv5A=="
+}
+
+MESSAGE Response:
+HTTP/1.1 202 Accepted
+```
+
+## Events
+
+Executor is expected to keep a **persistent** connection open to "/executor" endpoint even after getting a `SUBSCRIBED` HTTP Response event. This is indicated by "Connection: keep-alive" and "Transfer-Encoding: chunked" headers with *no* "Content-Length" header set. All subsequent events that are relevant to this executor generated by Mesos are streamed on this connection. Agent encodes each Event in [RecordIO](scheduler-http-api.md#recordio-response-format) format, i.e., string representation of length of the event in bytes followed by JSON or binary Protobuf  (possibly compressed) encoded event. Note that the value of length will never be ‘0’ and the size of the length will be the size of unsigned integer (i.e., 64 bits). Also, note that the `RecordIO` encoding should be decoded by the executor whereas the underlying HTTP chunked encoding is typically invisible at the application (executor) layer. The type of content encoding used for the events will be determined by the accep
 t header of the POST request (e.g., "Accept: application/json").
+
+The following events are currently sent by the agent. The canonical source of this information is at [executor.proto](include/mesos/v1/executor/executor.proto). Note that when sending JSON encoded events, agent encodes raw bytes in Base64 and strings in UTF-8.
+
+### SUBSCRIBED
+
+The first event sent by the agent when the executor sends a `SUBSCRIBE` request on the persistent connection. See `SUBSCRIBE` in Calls section for the format.
+
+### LAUNCH
+
+Sent by the agent whenever it needs to assign a new task to the executor. The executor is required to send an `UPDATE` message back to the agent indicating the success or failure of the task initialization.
+
+The executor must maintain a list of unacknowledged tasks (see `SUBSCRIBE` in `Calls` section). If for some reason, the executor is disconnected from the agent, these tasks must be sent as part of `SUBSCRIBE` request in the `tasks` field.
+
+```
+LAUNCH Event (JSON)
+<event-length>
+{
+  "type": "LAUNCH",
+  "launch": {
+    "framework_info": {
+      "id": {
+        "value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
+      },
+      "user": "foo",
+      "name": "my_framework"
+    },
+    "task": {
+      "name": "dummy-task",
+      "task_id": {
+        "value": "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"
+      },
+      "agent_id": {
+        "value": "f1c9cdc5-195e-41a7-a0d7-adaa9af07f81"
+      },
+      "command": {
+        "value": "sleep",
+        "arguments": [
+          "100"
+        ]
+      }
+    }
+  }
+}
+```
+
+### KILL
+
+The `KILL` event is sent whenever the scheduler needs to stop execution of a specific task. The executor is required to send a terminal update (e.g., `TASK_FINISHED`, `TASK_KILLED` or `TASK_FAILED`) back to the agent once it has stopped/killed the task. Mesos will mark the task resources as freed once the terminal update is received.
+
+```
+LAUNCH Event (JSON)
+<event-length>
+{
+  "type" : "KILL",
+  "kill" : {
+    "task_id" : {"value" : "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"}
+  }
+}
+```
+
+### ACKNOWLEDGED
+
+Sent by the agent in order to signal the executor that a status update was received as part of the reliable message passing mechanism. Acknowledged updates must not be retried.
+
+```
+ACKNOWLEDGED Event (JSON)
+<event-length>
+{
+  "type" : "ACKNOWLEDGED",
+  "acknowledged" : {
+    "task_id" : {"value" : "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"},
+    "uuid" : "ZDQwZjNmM2UtYmJlMy00NGFmLWEyMzAtNGNiMWVhZTcyZjY3Cg=="
+  }
+}
+```
+
+### MESSAGE
+
+Custom message generated by the scheduler and forwarded all the way to the executor. These messages are delivered "as-is" by Mesos and have no delivery guarantees. It is up to the scheduler to retry if a message is dropped for any reason. Note that `data` is raw bytes encoded as Base64.
+
+```
+MESSAGE Event (JSON)
+<event-length>
+{
+  "type" : "MESSAGE",
+  "message" : {
+    "data" : "c2FtcGxlIGRhdGE="
+  }
+}
+```
+
+### SHUTDOWN
+
+Sent by the agent in order to shutdown the executor. Once an executor gets a `SHUTDOWN` event it is required to kill all its tasks, send `TASK_KILLED` updates and gracefully exit. If an executor doesn't terminate within a certain period after the event was emitted (`grace_period_seconds`), the agent will forcefully destroy the container where the executor is running. The agent would then send `TASK_LOST` updates for any remaining active tasks of this executor.
+
+```
+SHUTDOWN Event (JSON)
+<event-length>
+{
+  "type" : "SHUTDOWN",
+  "shutdown" : {
+    "grace_period_seconds" : 5
+  }
+}
+```
+
+### ERROR
+
+Sent by the agent when an asynchronous error event is generated. It is recommended that the executor abort when it receives an error event and retry subscription.
+
+```
+ERROR Event (JSON)
+<event-length>
+{
+  "type" : "ERROR",
+  "error" : {
+    "message" : "Unrecoverable error"
+  }
+}
+```
+
+## Executor Environment Variables
+
+The following environment variables are set by the agent that can be used by the executor upon startup:
+
+* `MESOS_FRAMEWORK_ID`: `FrameworkID` of the scheduler needed as part of the `SUBSCRIBE` call.
+* `MESOS_EXECUTOR_ID`: `ExecutorID` of the executor needed as part of the `SUBSCRIBE` call.
+* `MESOS_DIRECTORY`: Path to the working directory for the executor.
+* `MESOS_AGENT_ENDPOINT`: Agent endpoint i.e. ip:port to be used by the executor to connect to the agent.
+* `MESOS_CHECKPOINT`: If set to true, denotes that framework has checkpointing enabled.
+
+If `MESOS_CHECKPOINT` is set i.e. when framework checkpointing is enabled, the following additional variables are also set that can be used by the executor for retrying upon a disconnection with the agent:
+
+* `MESOS_RECOVERY_TIMEOUT`: The total duration that the executor should spend retrying before shutting itself down when it is disconnected from the agent (e.g., `15mins`, `5secs` etc.). This is configurable at agent startup via the flag `--recovery_timeout`.
+* `MESOS_SUBSCRIPTION_BACKOFF_MAX`: The maximum backoff duration to be used by the executor between two retries when disconnected (e.g., `250ms`, `1mins` etc.). This is configurable at agent startup via the flag `--executor_reregistration_timeout`.
+
+NOTE: Additionally, the executor also inherits all the agent's environment variables.
+
+## Disconnections
+
+An executor considers itself disconnected if the persistent subscription connection (opened via SUBSCRIBE request) to "/executor" breaks. The disconnection can happen due to an agent process failure etc.
+
+Upon detecting a disconnection from the agent, the retry behavior depends on whether framework checkpointing is enabled:
+
+* If framework checkpointing is disabled, the executor is not supposed to retry subscription and gracefully exit.
+* If framework checkpointing is enabled, the executor is supposed to retry subscription using a suitable [backoff strategy](#backoff-strategies) for a duration of `MESOS_RECOVERY_TIMEOUT`. If it is not able to establish a subscription with the agent within this duration, it should gracefully exit.
+
+## Agent Recovery
+
+Upon agent startup, an agent performs [recovery](slave-recovery.md). This allows the agent to recover status updates and reconnect with old executors. Currently, the agent supports the following recovery mechanisms specified via the `--recover` flag:
+
+* **reconnect** (default): This mode allows the agent to reconnect with any of it’s old live executors provided the framework has enabled checkpointing. The recovery of the agent is only marked complete once all the disconnected executors have connected and hung executors have been destroyed. Hence, it is mandatory that every executor retries at least once within the interval (`MESOS_SUBSCRIPTION_BACKOFF_MAX`) to ensure it is not shutdown by the agent due to being hung/unresponsive.
+* **cleanup** : This mode kills any old live executors and then exits the agent. This is usually done by operators when making a non-compatible slave/executor upgrade. Upon receiving a `SUBSCRIBE` request from the executor of a framework with checkpointing enabled, the agent would send it a `SHUTDOWN` event as soon as it reconnects. For hung executors, the agent would wait for a duration of `--executor_shutdown_grace_period` (configurable at agent startup) and then forcefully kill the container where the executor is running in.
+
+
+## Backoff Strategies
+
+Executors are encouraged to retry subscription using a suitable backoff strategy like linear backoff, when they notice a disconnection with the agent. A disconnection typically happens when the agent process terminates (e.g., restarted for an upgrade). Each retry interval should be bounded by the value of `MESOS_SUBSCRIPTION_BACKOFF_MAX` which is set as an environment variable.

http://git-wip-us.apache.org/repos/asf/mesos/blob/e4a68e18/docs/home.md
----------------------------------------------------------------------
diff --git a/docs/home.md b/docs/home.md
index 6f0f4b9..bc22d31 100644
--- a/docs/home.md
+++ b/docs/home.md
@@ -51,6 +51,7 @@ layout: documentation
 * [Framework Development Guide](/documentation/latest/app-framework-development-guide/) describes how to build applications on top of Mesos.
 * [Reconciliation](/documentation/latest/reconciliation/) for ensuring a framework's state remains eventually consistent in the face of failures.
 * [Scheduler HTTP API](/documentation/latest/scheduler-http-api/) describes the new HTTP API for communication between schedulers and the Mesos master.
+* [Executor HTTP API](/documentation/latest/executor-http-api/) describes the new HTTP API for communication between executors and the Mesos agent.
 * [Javadoc](/api/latest/java/) documents the Mesos Java API.
 * [Doxygen](/api/latest/c++/namespacemesos.html) documents the Mesos C++ API.
 * [Developer Tools](/documentation/latest/tools/) for hacking on Mesos or writing frameworks.