You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Alexander Rukletsov (JIRA)" <ji...@apache.org> on 2017/09/01 13:20:00 UTC

[jira] [Commented] (MESOS-5352) Docker volume isolator cleanup can be blocked by first cleanup failure.

    [ https://issues.apache.org/jira/browse/MESOS-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150507#comment-16150507 ] 

Alexander Rukletsov commented on MESOS-5352:
--------------------------------------------

[~gilbert], what is the status here? Can we prioritize and fix it soon?

> Docker volume isolator cleanup can be blocked by first cleanup failure.
> -----------------------------------------------------------------------
>
>                 Key: MESOS-5352
>                 URL: https://issues.apache.org/jira/browse/MESOS-5352
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>            Priority: Critical
>              Labels: containerizer
>
> The summary title may be confusing, please look at the description below for details.
> Some background:
> 1). In docker volume isolator cleanup, currently we do reference counting for docker volumes. Volume driver `unmount` will only be called if the ref count is 1. 
> 2). We have built a hash map `infos` to track on docker volume mount information for one specific containerId. And a containerId will be erased form the hash map only if all driver `unmount` calls succeed (each subprocess return a ready future).
> The issue in this JIRA is that if we have a slave running (not shut down or reboot in this case), then keep launching frameworks which make use of docker volumes. Once any docker volume isolator cleanup returns a failure, all the other `unmount` calls to these volumes will be blocked by the reference count, since the `_cleanup()` returns a failure and the containerId in the hash map `infos` is not erased even through all volume may be unmounted/detached correctly. (docker volume isolator calls driver unmount as a subprocess, and a failure message may be possibly returned by the driver even if all volumes are unmount/detached correctly). Then, the extra containerId in infos could make all other isolator cleanup calls to contain one extra volume when doing the reference counting, which mean it rejects to call driver unmount. So after all tasks finish, all those docker volumes from the first failure will still with the `attached` status.
> This issue will be gone after the slave recover, but we cannot rely on restarting the slave every time hitting this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)