You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2019/04/19 03:14:00 UTC

[jira] [Comment Edited] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup

    [ https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821636#comment-16821636 ] 

Qian Zhang edited comment on MESOS-9306 at 4/19/19 3:13 AM:
------------------------------------------------------------

It seems isolator's `cleanup` was not called for this container. In Mesos agent log, I see the following CNI related messages for anther container `2a31c63b-370c-4e59-b0f4-bb8320017c3f`:
{code:java}
2018-10-10 14:14:24: I1010 14:14:24.274730 6805 cni.cpp:960] Bind mounted '/proc/26188/ns/net' to '/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f/ns' for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:14:24: I1010 14:14:24.505472 6804 cni.cpp:1394] Got assigned IPv4 address '9.0.1.5/25' from CNI network 'dcos' for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:14:24: I1010 14:14:24.505879 6810 cni.cpp:1100] Unable to find DNS nameservers for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f, using host '/etc/resolv.conf'
2018-10-10 14:20:50: I1010 14:20:50.317416 6808 cni.cpp:1670] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f/ns' for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:20:50: I1010 14:20:50.317649 6808 cni.cpp:1682] Removed the container directory '/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f'
{code}
The first 3 lines were from CNI isolator's `isolate` method, and the last 2 lines were from CNI isolator's `cleanup` method, this is the correct behavior.

But for this container `d463b9fe-970d-4077-bab9-558464889a9e`, I only see:
{code:java}
2018-10-10 14:17:16: I1010 14:17:16.536072 6804 cni.cpp:960] Bind mounted '/proc/30145/ns/net' to '/run/mesos/isolators/network/cni/d463b9fe-970d-4077-bab9-558464889a9e/ns' for container d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:17:16: I1010 14:17:16.743510 6802 cni.cpp:1394] Got assigned IPv4 address '9.0.1.7/25' from CNI network 'dcos' for container d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:17:16: I1010 14:17:16.743850 6802 cni.cpp:1100] Unable to find DNS nameservers for container d463b9fe-970d-4077-bab9-558464889a9e, using host '/etc/resolv.conf'
{code}
So it seems CNI isolator's `cleanup` method was not called. Basically CNI isolator should be the first one called in containerizer's destroy code path after Linux launcher's destroy is done, so it seems `MesosContainerizerProcess::cleanupIsolators` was not called for this container somehow.


was (Author: qianzhang):
It seems isolator's `cleanup` was not called for this container. In Mesos agent log, I see the following CNI related messages for anther container `2a31c63b-370c-4e59-b0f4-bb8320017c3f`:

 
{code:java}
2018-10-10 14:14:24: I1010 14:14:24.274730 6805 cni.cpp:960] Bind mounted '/proc/26188/ns/net' to '/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f/ns' for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:14:24: I1010 14:14:24.505472 6804 cni.cpp:1394] Got assigned IPv4 address '9.0.1.5/25' from CNI network 'dcos' for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:14:24: I1010 14:14:24.505879 6810 cni.cpp:1100] Unable to find DNS nameservers for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f, using host '/etc/resolv.conf'
2018-10-10 14:20:50: I1010 14:20:50.317416 6808 cni.cpp:1670] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f/ns' for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:20:50: I1010 14:20:50.317649 6808 cni.cpp:1682] Removed the container directory '/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f'
{code}
The first 3 lines were from CNI isolator's `isolate` method, and the last 2 lines were from CNI isolator's `cleanup` method, this is the correct behavior.

But for this container `d463b9fe-970d-4077-bab9-558464889a9e`, I only see:
{code:java}
2018-10-10 14:17:16: I1010 14:17:16.536072 6804 cni.cpp:960] Bind mounted '/proc/30145/ns/net' to '/run/mesos/isolators/network/cni/d463b9fe-970d-4077-bab9-558464889a9e/ns' for container d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:17:16: I1010 14:17:16.743510 6802 cni.cpp:1394] Got assigned IPv4 address '9.0.1.7/25' from CNI network 'dcos' for container d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:17:16: I1010 14:17:16.743850 6802 cni.cpp:1100] Unable to find DNS nameservers for container d463b9fe-970d-4077-bab9-558464889a9e, using host '/etc/resolv.conf'
{code}
So it seems CNI isolator's `cleanup` method was not called. Basically CNI isolator should be the first one called in containerizer's destroy code path after Linux launcher's destroy is done, so it seems `MesosContainerizerProcess::cleanupIsolators` was not called for this container somehow.

 

> Mesos containerizer can get stuck during cgroup cleanup
> -------------------------------------------------------
>
>                 Key: MESOS-9306
>                 URL: https://issues.apache.org/jira/browse/MESOS-9306
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization
>    Affects Versions: 1.7.0
>            Reporter: Greg Mann
>            Priority: Critical
>              Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely destroyed after its associated tasks were killed. The following is an excerpt from the agent log which is filtered to include only lines with the container ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526  6810 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932  6801 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531  6805 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being polled once per minute. This seems to indicate that the container in question is still in the agent's {{containers_}} map.
> So, it seems that the containerizer is stuck either in the Linux launcher's {{destroy()}} code path, or the containerizer's {{destroy()}} code path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)