You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (JIRA)" <ji...@apache.org> on 2019/05/27 16:11:00 UTC
[jira] [Comment Edited] (MESOS-9306) Mesos containerizer can get
stuck during cgroup cleanup
[ https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849042#comment-16849042 ]
Andrei Budnik edited comment on MESOS-9306 at 5/27/19 4:10 PM:
---------------------------------------------------------------
The patch `/r/70609/` was discarded.
If `cgroups::destroy` hangs due to a blocking system call caused by a kernel bug, then there is no workaround available on Mesos side to fix the issue. In this case, we could only help an operator to detect the problem. This can be achieved by introducing a debug endpoint for the Mesos containerizer, see MESOS-9756.
was (Author: abudnik):
The patch `/r/70609/` was discarded.
If `cgroups::destroy` hangs due to a blocking system call caused by a kernel bug, then there is no workaround available on Mesos side to fix the issue. In this case, we could only help an operator to detect the problem. This could be done by introducing a debug endpoint for the Mesos containerizer, see MESOS-9756.
> Mesos containerizer can get stuck during cgroup cleanup
> -------------------------------------------------------
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
> Issue Type: Bug
> Components: agent, containerization
> Affects Versions: 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0
> Reporter: Greg Mann
> Assignee: Andrei Budnik
> Priority: Critical
> Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely destroyed after its associated tasks were killed. The following is an excerpt from the agent log which is filtered to include only lines with the container ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756 6799 containerizer.cpp:2963] Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839 6799 containerizer.cpp:2457] Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859 6799 containerizer.cpp:3124] Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960 6799 linux_launcher.cpp:580] Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993 6799 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417 6806 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477 6810 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708 6808 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878 6800 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185 6799 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226 6808 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455 6808 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803 6810 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531 6805 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855 6809 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224 6800 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946 6799 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979 6804 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784 6808 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526 6810 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932 6801 containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531 6805 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being polled once per minute. This seems to indicate that the container in question is still in the agent's {{containers_}} map.
> So, it seems that the containerizer is stuck either in the Linux launcher's {{destroy()}} code path, or the containerizer's {{destroy()}} code path.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)