You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joseph Wu (JIRA)" <ji...@apache.org> on 2019/05/01 02:15:00 UTC
[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may
report a complete executor's tasks as non-terminal after a graceful agent
shutdown
[ https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830821#comment-16830821 ]
Joseph Wu commented on MESOS-9750:
----------------------------------
Preliminary fix and test here: https://reviews.apache.org/r/70577/
> Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown
> ------------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
> Issue Type: Bug
> Components: agent, executor
> Affects Versions: 1.6.0, 1.7.0, 1.8.0
> Reporter: Joseph Wu
> Assignee: Joseph Wu
> Priority: Major
> Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to status updates).
> When the agent starts back up, the completed executor will be recovered and shows up correctly as a completed executor in {{/state}}. However, if you fetch the V1 {{GET_STATE}} result, there will be an entry in {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
> launched_tasks {
> name: "test-task"
> task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
> }
> executor_id {
> value: "default"
> }
> agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
> task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> state: TASK_RUNNING
> agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> timestamp: 1556674758.2175469
> executor_id {
> value: "default"
> }
> source: SOURCE_EXECUTOR
> uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
> container_status { ... }
> }
> }
> }
> get_executors {
> completed_executors {
> executor_info {
> executor_id {
> value: "default"
> }
> command {
> value: ""
> }
> framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
> }
> }
> }
> }
> get_frameworks {
> completed_frameworks {
> framework_info {
> user: "user"
> name: "default"
> id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
> }
> checkpoint: true
> hostname: "localhost"
> principal: "test-principal"
> capabilities {
> type: MULTI_ROLE
> }
> capabilities {
> type: RESERVATION_REFINEMENT
> }
> roles: "*"
> }
> }
> }
> {code}
> This happens because we combine executors and completed executors when constructing the response. The terminal task(s) with non-terminal updates appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)