You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Ian Downes (JIRA)" <ji...@apache.org> on 2014/08/08 23:49:12 UTC
[jira] [Updated] (MESOS-1689) Race with kernel to kill process / destroy cgroup after OOM

     [ https://issues.apache.org/jira/browse/MESOS-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ian Downes updated MESOS-1689:
------------------------------

    Description: 
The recently refactored cgroup::destroy code can fail to freeze a freezer cgroup under a particular ordering of events, as detailed below. The LinuxLauncher will fail to destroy the cgroup and other isolators will not be able to destroy their cgroups.

This failure will be logged but otherwise ignored by a running slave. If the slave is subsequently restarted it will block on the cgroup::destroy during launcher recovery, timing out after 60 seconds and causing recovery to fail, which will then cause the slave to terminate. If the slave is monitored and automatically restarted it will repeatedly flap.

The problem appears as a container freezer cgroup that will not transition to FROZEN and remains in FREEZING. This is because one or more processes cannot be frozen.

{noformat}
[idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state         FREEZING
[idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs  | xargs ps -L
  PID   LWP TTY      STAT   TIME COMMAND
29369 29369 ?        Dsl    0:02 python2.6 ./thermos_executor
29369 29482 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29483 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29484 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29485 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29486 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29487 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29488 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29489 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29490 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29491 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29492 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29493 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29494 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29495 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29496 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29497 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29498 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29499 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29500 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29582 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29583 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29584 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29585 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29604 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29605 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29606 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29607 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29608 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29610 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29612 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29613 ?        Dsl    0:00 python2.6 ./thermos_executor
29526 29526 ?        D      0:02 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29578 29578 ?        Ds    29:49 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29603 29603 ?        R    254:08 python2.6 /usr/local/bin/package_cache 719808749561cd7e77d8a22df9f36643 hftp://hadoop-r

{noformat}

Debugging with [~jieyu] indicates the following sequence of events:
1. Cgroup reaches memory limit

2. Kernel notifies Mesos that an OOM condition has occured.

3. Mesos initiates freezing the cgroup.

4. There is now a race in the kernel between freezing processes and deciding which process to kill. The kernel does check a process is suitable for killing and will thaw the process if necessary but there is a window between this check and actually signaling the process. It can occur that the selected process is frozen before the signal is delivered and thus the process does not die, and therefore does not release its memory.
{noformat}
[idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child
Aug  8 01:52:05 s_all@hostname kernel: [804284.985019] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child

[idownes@hostname ~]$ cat /proc/29578/status | grep State
State:  D (disk sleep)

[idownes@hostname ~]$ cat /proc/29578/status | grep SigPnd
SigPnd: 0000000000000100

[idownes@hostname ~]$ sudo cat /proc/29578/stack
[sudo] password for idownes:
[<ffffffff8107b17d>] __refrigerator+0xc9/0x125
[<ffffffff8104b1fe>] try_to_freeze+0x30/0x32
[<ffffffff8104d8e5>] get_signal_to_deliver+0x50/0x4ba
[<ffffffff81002217>] do_signal+0x49/0x5ec
[<ffffffff810027e3>] do_notify_resume+0x29/0x65
[<ffffffff814abd12>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}

5. Mesos continues to wait for the cgroup to freeze, polling on freezer.state.

6. Meanwhile, another process in the cgroup tries to allocate a page, this triggers a page fault which fails because the cgroup is at its limit (the selected process hasn't been killed to free memory). This process keeps faulting and repeatedly jumps between runnable/running and uninterruptible. Regardless, it does not get frozen.
{noformat}
[idownes@hostname ~]$ sudo cat /proc/29603/stack
[<ffffffff81065246>] __cond_resched+0x26/0x30
[<ffffffff81125abd>] vmpressure+0x3b/0x7c
[<ffffffff81125b18>] vmpressure_prio+0x1a/0x1c
[<ffffffff810f02ce>] do_try_to_free_pages+0x71/0x31c
[<ffffffff810f0748>] try_to_free_mem_cgroup_pages+0xb0/0x10f
[<ffffffff81122d3d>] mem_cgroup_reclaim+0x4d/0xb2
[<ffffffff811231ba>] __mem_cgroup_try_charge+0x2a9/0x8ff
[<ffffffff81123c52>] mem_cgroup_charge_common+0x35/0x5d
[<ffffffff81124dba>] mem_cgroup_newpage_charge+0x3a/0x3c
[<ffffffff810fe48d>] handle_pte_fault+0x66f/0x749
[<ffffffff810ff3e4>] handle_mm_fault+0xf0/0x123
[<ffffffff814a8102>] __do_page_fault+0x1b5/0x35f
[<ffffffff814a82ba>] do_page_fault+0xe/0x10
[<ffffffff814a53f2>] page_fault+0x22/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}

7. Freezing the cgroup cannot progress. After 60 seconds the slave gives up and the destroy is discarded.

The cgroup can be unstuck by any of the following:
a) killing the faulting process, which lets the cgroup fully freeze
b) THAWing the cgroup which delivers the SIGKILL to the selected process, freeing memory for the faulting process to continue so the next freeze attempt succeeds
c) Slightly increasing the memory cgroup's limit so the page fault can allocate a page

  was:
The recently refactored cgroup::destroy code can fail to freeze a freezer cgroup under a particular ordering of events, as detailed below. The LinuxLauncher will fail to destroy the cgroup and other isolators will not be able to destroy their cgroups.

This failure will be logged but otherwise ignored by a running slave. If the slave is subsequently restarted it will block on the cgroup::destroy during launcher recovery, timing out after 60 seconds and causing recovery to fail, which will then cause the slave to terminate. If the slave is monitored and automatically restarted it will repeatedly flap.

The problem appears as a container freezer cgroup that will not transition to FROZEN and remains in FREEZING. This is because one or more processes cannot be frozen.

{noformat}
[idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state         FREEZING
[idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs  | xargs ps -L
  PID   LWP TTY      STAT   TIME COMMAND
29369 29369 ?        Dsl    0:02 python2.6 ./thermos_executor
29369 29482 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29483 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29484 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29485 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29486 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29487 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29488 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29489 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29490 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29491 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29492 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29493 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29494 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29495 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29496 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29497 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29498 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29499 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29500 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29582 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29583 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29584 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29585 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29604 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29605 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29606 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29607 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29608 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29610 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29612 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29613 ?        Dsl    0:00 python2.6 ./thermos_executor
29526 29526 ?        D      0:02 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29578 29578 ?        Ds    29:49 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29603 29603 ?        R    254:08 python2.6 /usr/local/bin/package_cache 719808749561cd7e77d8a22df9f36643 hftp://hadoop-r

{noformat}

Debugging with [~jieyu] indicates the following sequence of events:
1. Cgroup reaches memory limit

2. Kernel notifies Mesos that an OOM condition has occured.

3. Mesos initiates freezing the cgroup.

4. There is now a race in the kernel between freezing processes and deciding which process to kill. The kernel does check a process is suitable for killing and will thaw the process if necessary but there is a window between this check and actually signaling the process. It can occur that the selected process is frozen before the signal is delivered and thus the process does not die, and therefore does not release its memory.
{noformat}
[idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child
Aug  8 01:52:05 s_all@hostname kernel: [804284.985019] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child

[idownes@hostname ~]$ cat /proc/29578/status | grep State
State:  D (disk sleep)

[idownes@hostname ~]$ cat /proc/29578/status | grep SigPnd
SigPnd: 0000000000000100

[idownes@hostname ~]$ sudo cat /proc/29578/stack
[sudo] password for idownes:
[<ffffffff8107b17d>] __refrigerator+0xc9/0x125
[<ffffffff8104b1fe>] try_to_freeze+0x30/0x32
[<ffffffff8104d8e5>] get_signal_to_deliver+0x50/0x4ba
[<ffffffff81002217>] do_signal+0x49/0x5ec
[<ffffffff810027e3>] do_notify_resume+0x29/0x65
[<ffffffff814abd12>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}

5. Mesos continues to wait for the cgroup to freeze, polling on freezer.state.

6. Meanwhile, another process in the cgroup tries to allocate a page, this triggers a page fault which fails because the cgroup is at its limit (the selected process hasn't been killed to free memory). This process keeps faulting and repeatedly jumps between runnable/running and uninterruptible. Regardless, it does not get frozen.
{noformat}
[idownes@hostname ~]$ sudo cat /proc/29603/stack
[<ffffffff81065246>] __cond_resched+0x26/0x30
[<ffffffff81125abd>] vmpressure+0x3b/0x7c
[<ffffffff81125b18>] vmpressure_prio+0x1a/0x1c
[<ffffffff810f02ce>] do_try_to_free_pages+0x71/0x31c
[<ffffffff810f0748>] try_to_free_mem_cgroup_pages+0xb0/0x10f
[<ffffffff81122d3d>] mem_cgroup_reclaim+0x4d/0xb2
[<ffffffff811231ba>] __mem_cgroup_try_charge+0x2a9/0x8ff
[<ffffffff81123c52>] mem_cgroup_charge_common+0x35/0x5d
[<ffffffff81124dba>] mem_cgroup_newpage_charge+0x3a/0x3c
[<ffffffff810fe48d>] handle_pte_fault+0x66f/0x749
[<ffffffff810ff3e4>] handle_mm_fault+0xf0/0x123
[<ffffffff814a8102>] __do_page_fault+0x1b5/0x35f
[<ffffffff814a82ba>] do_page_fault+0xe/0x10
[<ffffffff814a53f2>] page_fault+0x22/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}

7. Freezing the cgroup cannot progress. After 60 seconds the slave gives up and the destroy is discarded.

The cgroup can be unstuck by either a) killing the faulting process, which lets the cgroup fully freeze, or b) THAWing the cgroup which delivers the SIGKILL to the selected process, freeing memory for the faulting process to continue so the next freeze attempt succeeds.


> Race with kernel to kill process / destroy cgroup after OOM
> -----------------------------------------------------------
>
>                 Key: MESOS-1689
>                 URL: https://issues.apache.org/jira/browse/MESOS-1689
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Ian Downes
>
> The recently refactored cgroup::destroy code can fail to freeze a freezer cgroup under a particular ordering of events, as detailed below. The LinuxLauncher will fail to destroy the cgroup and other isolators will not be able to destroy their cgroups.
> This failure will be logged but otherwise ignored by a running slave. If the slave is subsequently restarted it will block on the cgroup::destroy during launcher recovery, timing out after 60 seconds and causing recovery to fail, which will then cause the slave to terminate. If the slave is monitored and automatically restarted it will repeatedly flap.
> The problem appears as a container freezer cgroup that will not transition to FROZEN and remains in FREEZING. This is because one or more processes cannot be frozen.
> {noformat}
> [idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state         FREEZING
> [idownes@hostname ~]$ cat /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs  | xargs ps -L
>   PID   LWP TTY      STAT   TIME COMMAND
> 29369 29369 ?        Dsl    0:02 python2.6 ./thermos_executor
> 29369 29482 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29483 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29484 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29485 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29486 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29487 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29488 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29489 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29490 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29491 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29492 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29493 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29494 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29495 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29496 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29497 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29498 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29499 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29500 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29582 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29583 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29584 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29585 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29604 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29605 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29606 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29607 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29608 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29610 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29612 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29613 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29526 29526 ?        D      0:02 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29578 29578 ?        Ds    29:49 python2.6 /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29603 29603 ?        R    254:08 python2.6 /usr/local/bin/package_cache 719808749561cd7e77d8a22df9f36643 hftp://hadoop-r
> {noformat}
> Debugging with [~jieyu] indicates the following sequence of events:
> 1. Cgroup reaches memory limit
> 2. Kernel notifies Mesos that an OOM condition has occured.
> 3. Mesos initiates freezing the cgroup.
> 4. There is now a race in the kernel between freezing processes and deciding which process to kill. The kernel does check a process is suitable for killing and will thaw the process if necessary but there is a window between this check and actually signaling the process. It can occur that the selected process is frozen before the signal is delivered and thus the process does not die, and therefore does not release its memory.
> {noformat}
> [idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
> Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child
> Aug  8 01:52:05 s_all@hostname kernel: [804284.985019] Memory cgroup out of memory: Kill process 29578 (python2.6) score 413 or sacrifice child
> [idownes@hostname ~]$ cat /proc/29578/status | grep State
> State:  D (disk sleep)
> [idownes@hostname ~]$ cat /proc/29578/status | grep SigPnd
> SigPnd: 0000000000000100
> [idownes@hostname ~]$ sudo cat /proc/29578/stack
> [sudo] password for idownes:
> [<ffffffff8107b17d>] __refrigerator+0xc9/0x125
> [<ffffffff8104b1fe>] try_to_freeze+0x30/0x32
> [<ffffffff8104d8e5>] get_signal_to_deliver+0x50/0x4ba
> [<ffffffff81002217>] do_signal+0x49/0x5ec
> [<ffffffff810027e3>] do_notify_resume+0x29/0x65
> [<ffffffff814abd12>] int_signal+0x12/0x17
> [<ffffffffffffffff>] 0xffffffffffffffff
> {noformat}
> 5. Mesos continues to wait for the cgroup to freeze, polling on freezer.state.
> 6. Meanwhile, another process in the cgroup tries to allocate a page, this triggers a page fault which fails because the cgroup is at its limit (the selected process hasn't been killed to free memory). This process keeps faulting and repeatedly jumps between runnable/running and uninterruptible. Regardless, it does not get frozen.
> {noformat}
> [idownes@hostname ~]$ sudo cat /proc/29603/stack
> [<ffffffff81065246>] __cond_resched+0x26/0x30
> [<ffffffff81125abd>] vmpressure+0x3b/0x7c
> [<ffffffff81125b18>] vmpressure_prio+0x1a/0x1c
> [<ffffffff810f02ce>] do_try_to_free_pages+0x71/0x31c
> [<ffffffff810f0748>] try_to_free_mem_cgroup_pages+0xb0/0x10f
> [<ffffffff81122d3d>] mem_cgroup_reclaim+0x4d/0xb2
> [<ffffffff811231ba>] __mem_cgroup_try_charge+0x2a9/0x8ff
> [<ffffffff81123c52>] mem_cgroup_charge_common+0x35/0x5d
> [<ffffffff81124dba>] mem_cgroup_newpage_charge+0x3a/0x3c
> [<ffffffff810fe48d>] handle_pte_fault+0x66f/0x749
> [<ffffffff810ff3e4>] handle_mm_fault+0xf0/0x123
> [<ffffffff814a8102>] __do_page_fault+0x1b5/0x35f
> [<ffffffff814a82ba>] do_page_fault+0xe/0x10
> [<ffffffff814a53f2>] page_fault+0x22/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> {noformat}
> 7. Freezing the cgroup cannot progress. After 60 seconds the slave gives up and the destroy is discarded.
> The cgroup can be unstuck by any of the following:
> a) killing the faulting process, which lets the cgroup fully freeze
> b) THAWing the cgroup which delivers the SIGKILL to the selected process, freeing memory for the faulting process to continue so the next freeze attempt succeeds
> c) Slightly increasing the memory cgroup's limit so the page fault can allocate a page



--
This message was sent by Atlassian JIRA
(v6.2#6252)