You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Charles (Jira)" <ji...@apache.org> on 2020/03/31 08:08:00 UTC
[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

    [ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071571#comment-17071571 ] 

Charles commented on MESOS-10107:
---------------------------------

 [^reproduce-cgroup-rmdir-race.py]  [^mesos-remove-cgroup-race.diff] 

> containeriser: failed to remove cgroup - EBUSY
> ----------------------------------------------
>
>                 Key: MESOS-10107
>                 URL: https://issues.apache.org/jira/browse/MESOS-10107
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Charles
>            Priority: Major
>         Attachments: mesos-remove-cgroup-race.diff, reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container cgroup isn't properly destroyed after the OOM killer kicked in when memory limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all processes in the container: Failed to remove cgroup 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, but an strace confirmed that the cgroup directory was only deleted once all tasks had exited (edited and commented strace below from a different instance of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY <unfinished ...>
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.000036>
> 3431  23:01:32.294220 read(18,  <unfinished ...>
> 3431  23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) =
> 15 <0.000033>
> 3431  23:01:32.294306 read(18, "", 4096) = 0 <0.000013>
> 3431  23:01:32.294346 close(18 <unfinished ...>
> 3431  23:01:32.294402 <... close resumed> ) = 0 <0.000045>
> #kill them
> 3431  23:01:32.296266 kill(5878, SIGKILL) = 0 <0.000019>
> 3431  23:01:32.296384 kill(6036, SIGKILL <unfinished ...>
> 3431  23:01:32.296474 kill(6210, SIGKILL) = 0 <0.000013>
> 6210  23:01:32.299159 +++ killed by SIGKILL +++
> 5878  23:01:32.327551 +++ killed by SIGKILL +++
> 3431  23:01:32.327601 --- SIGCHLD {si_signo=SIGCHLD,
> si_code=CLD_KILLED, si_pid=5878, si_uid=1000, si_status=SIGKILL,
> si_utime=18, si_stime=4} ---
> 6036  23:01:32.327923 +++ killed by SIGKILL +++
> # wait them
> 3429  23:01:32.338912 wait4(5878,  <unfinished ...>
> 3429  23:01:32.339078 <... wait4 resumed> [{WIFSIGNALED(s) &&
> WTERMSIG(s) == SIGKILL}], WNOHANG, NULL) = 5878 <0.000157>
> 3429  23:01:32.339944 wait4(6036, 0x7fcfb28ee5e0, WNOHANG, NULL) = -1
> ECHILD (Aucun processus enfant) <0.000005>
> 3429  23:01:32.340938 wait4(6210,  <unfinished ...>
> 3429  23:01:32.341165 <... wait4 resumed> 0x7fcfb28ee5e0, WNOHANG,
> NULL) = -1 ECHILD (Aucun processus enfant) <0.000218>
> # check cgroup empty
> 3426  23:01:32.341490 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY <unfinished ...>
> 3426  23:01:32.342474 read(47,  <unfinished ...>
> 3426  23:01:32.342842 <... read resumed> "", 8192) = 0 <0.000359>
> 3426  23:01:32.342940 close(47 <unfinished ...>
> 3426  23:01:32.344579 <... close resumed> ) = 0 <0.001631>
> # cgroup empty - attempt to delete
> 3428  23:01:32.345596
> rmdir("/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb"
> <unfinished ...>
> 3428  23:01:32.345679 <... rmdir resumed> ) = -1 EBUSY (Device or resource busy) <0.000068>
> {noformat}
> So as can be seen, the cgroup destruction code correctly waits until the tasks has exited to destroy the cgroup, which is empty as can be seen by the code reading from cgroup.procs right before the rmdir.
> It turns out that it's actually a kernel bug recently reported and fixed: https://lkml.org/lkml/2020/1/15/1349
> See reproducer attached - it repeatedly starts a task consuming several GB of memory, with a memory limit below slightly below to trigger OOM.
> It's a kernel bug, but since it affects virtually every kernel/Mesos currently in use I think it would make sense to add a workaround - simply retrying to delete the cgroup after a timeout fixes it for me, see patch attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)