You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joseph Wu (JIRA)" <ji...@apache.org> on 2019/05/01 02:15:00 UTC

[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

    [ https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830821#comment-16830821 ] 

Joseph Wu commented on MESOS-9750:
----------------------------------

Preliminary fix and test here: https://reviews.apache.org/r/70577/

> Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9750
>                 URL: https://issues.apache.org/jira/browse/MESOS-9750
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>    Affects Versions: 1.6.0, 1.7.0, 1.8.0
>            Reporter: Joseph Wu
>            Assignee: Joseph Wu
>            Priority: Major
>              Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to status updates).
> When the agent starts back up, the completed executor will be recovered and shows up correctly  as a completed executor in {{/state}}.  However, if you fetch the V1 {{GET_STATE}} result, there will be an entry in {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
>     name: "test-task"
>     task_id {
>       value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>     }
>     framework_id {
>       value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>     }
>     executor_id {
>       value: "default"
>     }
>     agent_id {
>       value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>     }
>     state: TASK_RUNNING
>     resources { ... }
>     resources { ... }
>     resources { ... }
>     resources { ... }
>     statuses {
>       task_id {
>         value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>       }
>       state: TASK_RUNNING
>       agent_id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>       }
>       timestamp: 1556674758.2175469
>       executor_id {
>         value: "default"
>       }
>       source: SOURCE_EXECUTOR
>       uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>       container_status { ... }
>     }
>   }
> }
> get_executors {
>   completed_executors {
>     executor_info {
>       executor_id {
>         value: "default"
>       }
>       command {
>         value: ""
>       }
>       framework_id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>       }
>     }
>   }
> }
> get_frameworks {
>   completed_frameworks {
>     framework_info {
>       user: "user"
>       name: "default"
>       id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>       }
>       checkpoint: true
>       hostname: "localhost"
>       principal: "test-principal"
>       capabilities {
>         type: MULTI_ROLE
>       }
>       capabilities {
>         type: RESERVATION_REFINEMENT
>       }
>       roles: "*"
>     }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when constructing the response.  The terminal task(s) with non-terminal updates appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)