You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Benno Evers (JIRA)" <ji...@apache.org> on 2018/01/04 13:57:00 UTC

[jira] [Commented] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

    [ https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311352#comment-16311352 ] 

Benno Evers commented on MESOS-8391:
------------------------------------

I could confirm this behaviour with mesos 1.5 on a DC/OS 1.11 cluster.

For case (2), while the system state eventually returns to normal and marathon correctly re-schedules the two tasks, the original task seems to stay in the `TASK_KILLING` state indefinitely.

From a quick look at the logs, the agent gets as far as "Checkpointing termination state to nested container's runtime directory", but never attempts to destroy the parent container afterwards. I'm currently looking at the container destruction code path to see what the expected behaviour would be.

> Mesos agent doesn't notice that a pod task exits or crashes after the agent restart
> -----------------------------------------------------------------------------------
>
>                 Key: MESOS-8391
>                 URL: https://issues.apache.org/jira/browse/MESOS-8391
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization, executor
>    Affects Versions: 1.5.0
>            Reporter: Ivan Chernetsky
>            Priority: Critical
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 10000}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks that both tasks are running just fine. Marathon doesn't take any action because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards the correct status update, but the other task stays in {{TASK_KILLING}} state forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, forwards the corresponding status update, but the other task stays in `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)