You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Greg Mann (JIRA)" <ji...@apache.org> on 2018/02/02 00:36:10 UTC
[jira] [Commented] (MESOS-8488) Docker bug can cause unkillable tasks

    [ https://issues.apache.org/jira/browse/MESOS-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349586#comment-16349586 ] 

Greg Mann commented on MESOS-8488:
----------------------------------

Good point [~qianzhang] - adding a timeout would make sure that a user can kill the task, but it would not proactively notice that the container has already exited.

Since the Docker daemon itself does not know the container has exited, we would need to use some "out-of-band" method to discover this. Do you know of such a method?

If we can't come up with a good way to catch the container exit, then maybe we should settle for a timeout which allows stuck tasks to be killed. This is also a more general issue with Docker containers when the Docker daemon is hung on an agent. If the Docker daemon is not responsive, tasks launched with the Docker executor cannot be killed. If we add a timeout after which the executor commits suicide, then we could make sure that the executors for all such tasks will be killed, but I'm afraid I don't know enough yet about Docker's recovery behavior to say whether or not we could rely on the container being cleaned up, whenever the Docker daemon is successfully recovered.

> Docker bug can cause unkillable tasks
> -------------------------------------
>
>                 Key: MESOS-8488
>                 URL: https://issues.apache.org/jira/browse/MESOS-8488
>             Project: Mesos
>          Issue Type: Improvement
>          Components: containerization
>    Affects Versions: 1.5.0
>            Reporter: Greg Mann
>            Assignee: Qian Zhang
>            Priority: Major
>              Labels: mesosphere
>
> Due to an [issue on the Moby project|https://github.com/moby/moby/issues/33820], it's possible for Docker versions 1.13 and later to fail to catch a container exit, so that the {{docker run}} command which was used to launch the container will never return. This can lead to the Docker executor becoming stuck in a state where it believes the container is still running and cannot be killed.
> We should update the Docker executor to ensure that containers stuck in such a state cannot cause unkillable Docker executors/tasks.
> One way to do this would be a timeout, after which the Docker executor will commit suicide if a kill task attempt has not succeeded. However, if we do this we should also ensure that in the case that the container was actually still running, either the Docker daemon or the DockerContainerizer would clean up the container when it does exit.
> Another option might be for the Docker executor to directly {{wait()}} on the container's Linux PID, in order to notice when the container exits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)