You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Whitney Sorenson <ws...@hubspot.com> on 2014/07/01 18:12:21 UTC

cgroups OOM handler causing lockups?

We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the
new generation C3 machines (generally c3.8xl) and have been experiencing
frequent system reboots.

Due to this issue (
http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E)
we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets of
machines seem equally likely to experience reboots, although the 3.10
machines do not come back unaided.

It seems that the kernel runs into problems in the OOM handler, and we see
traces such as:

[378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
(https://gist.github.com/wsorenson/d2a12f1892b43aa28936)

Is this possibly related to https://issues.apache.org/jira/browse/MESOS-662
 ?

Any guidance on how to debug further or if this is a known issue with
certain mesos versions? Some sleuthing indicates that a patch for the above
may have been added, removed
<https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=f90fe7641ea8f7066a6a1171a24ddaa8dc30e789>,
and added again
<https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=326aa493fb445302137d538d456712249504d251>
later.

Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report
some cgroup deadlock fixes. We are testing this out asap.

Thanks,

-Whitney

Re: cgroups OOM handler causing lockups?

Posted by Whitney Sorenson <ws...@hubspot.com>.
Following up on this issue.

After setting up a test cluster running many parallel VMs and cgroups OOMs
we were able to isolate at least one current issue:
https://bugzilla.kernel.org/show_bug.cgi?id=80881

We've also noticed a separate deadlock issue occurring in 3.10 which does
not appear to be present in 3.12+.

We did not have any issues with 3.4. We are currently in the process of
moving over to 3.4 as it also does not appear to trigger cgroups OOMs
because of cache usage (which was the problem in 2.6.x.) (in other words,
resolves the issue described here:
http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E
)

-Whitney




On Tue, Jul 1, 2014 at 1:45 PM, Whitney Sorenson <ws...@hubspot.com>
wrote:

> Thanks for clearing up about those patches.
>
> I can confirm:
>
> cat /cgroup/memory/memory.oom_control
> oom_kill_disable 0
> under_oom 0
>
> We can try to reproduce outside of Mesos and see if we have similar
> issues. Thankfully, we are not using EBS.
> -Whitney
>
>
>
> On Tue, Jul 1, 2014 at 1:36 PM, Ian Downes <ia...@gmail.com> wrote:
>
>> Hi Whitney,
>>
>> As Vinod said, 0.18.0 will ensure the kernel is set handle OOM
>> conditions. The patches you linked are refactors that should not have
>> changed the behavior since 0.18.0. Could you please double check that
>> /sys/fs/cgroup/memory/memory.oom.control has "oom_kill_disable 0"?
>>
>> Can you attempt to reproduce this outside of Mesos by running a
>> process inside a manually created memory cgroup. Something like the dd
>> command in the thread you linked should trigger the OOM handler to run
>> and probably kill processes. Or, perhaps run your java process with a
>> much lower memory limit.
>>
>> From the trace you provided I see mention of ext4 write so it looks
>> like the OOM handler is indeed trying to flush dirty pages to disk.
>> Are you running this on EBS? If so, there could be timing issues here
>> that the kernel isn't handling well; can you test using just the
>> ephemeral disk(s)?
>>
>> Please do let us know how this goes!
>>
>> Ian
>>
>> On Tue, Jul 1, 2014 at 10:17 AM, Vinod Kone <vi...@gmail.com> wrote:
>> > Hey Whitney,
>> >
>> > I'll let Ian Downes comment on the specific patches you linked, but at a
>> > high level the bug in MESOS-662 was due to Mesos trying to handle OOM
>> > situations in user space instead of letting kernel handle it. We have
>> since
>> > then changed the behavior to let Kernel handle the OOM. You can confirm
>> this
>> > by checking "oom.control" file in the cgroup of your container (it
>> should
>> > say 'oom_kill_disable 0').
>> >
>> >
>> > On Tue, Jul 1, 2014 at 9:12 AM, Whitney Sorenson <wsorenson@hubspot.com
>> >
>> > wrote:
>> >>
>> >> We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on
>> the
>> >> new generation C3 machines (generally c3.8xl) and have been
>> experiencing
>> >> frequent system reboots.
>> >>
>> >> Due to this issue
>> >> (
>> http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E
>> )
>> >> we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
>> >> machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets
>> of
>> >> machines seem equally likely to experience reboots, although the 3.10
>> >> machines do not come back unaided.
>> >>
>> >> It seems that the kernel runs into problems in the OOM handler, and we
>> see
>> >> traces such as:
>> >>
>> >> [378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
>> >> (https://gist.github.com/wsorenson/d2a12f1892b43aa28936)
>> >>
>> >> Is this possibly related to
>> >> https://issues.apache.org/jira/browse/MESOS-662 ?
>> >>
>> >> Any guidance on how to debug further or if this is a known issue with
>> >> certain mesos versions? Some sleuthing indicates that a patch for the
>> above
>> >> may have been added, removed, and added again later.
>> >>
>> >> Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they
>> report
>> >> some cgroup deadlock fixes. We are testing this out asap.
>> >>
>> >> Thanks,
>> >>
>> >> -Whitney
>> >
>> >
>>
>
>

Re: cgroups OOM handler causing lockups?

Posted by Whitney Sorenson <ws...@hubspot.com>.
Thanks for clearing up about those patches.

I can confirm:

cat /cgroup/memory/memory.oom_control
oom_kill_disable 0
under_oom 0

We can try to reproduce outside of Mesos and see if we have similar issues.
Thankfully, we are not using EBS.
-Whitney



On Tue, Jul 1, 2014 at 1:36 PM, Ian Downes <ia...@gmail.com> wrote:

> Hi Whitney,
>
> As Vinod said, 0.18.0 will ensure the kernel is set handle OOM
> conditions. The patches you linked are refactors that should not have
> changed the behavior since 0.18.0. Could you please double check that
> /sys/fs/cgroup/memory/memory.oom.control has "oom_kill_disable 0"?
>
> Can you attempt to reproduce this outside of Mesos by running a
> process inside a manually created memory cgroup. Something like the dd
> command in the thread you linked should trigger the OOM handler to run
> and probably kill processes. Or, perhaps run your java process with a
> much lower memory limit.
>
> From the trace you provided I see mention of ext4 write so it looks
> like the OOM handler is indeed trying to flush dirty pages to disk.
> Are you running this on EBS? If so, there could be timing issues here
> that the kernel isn't handling well; can you test using just the
> ephemeral disk(s)?
>
> Please do let us know how this goes!
>
> Ian
>
> On Tue, Jul 1, 2014 at 10:17 AM, Vinod Kone <vi...@gmail.com> wrote:
> > Hey Whitney,
> >
> > I'll let Ian Downes comment on the specific patches you linked, but at a
> > high level the bug in MESOS-662 was due to Mesos trying to handle OOM
> > situations in user space instead of letting kernel handle it. We have
> since
> > then changed the behavior to let Kernel handle the OOM. You can confirm
> this
> > by checking "oom.control" file in the cgroup of your container (it should
> > say 'oom_kill_disable 0').
> >
> >
> > On Tue, Jul 1, 2014 at 9:12 AM, Whitney Sorenson <ws...@hubspot.com>
> > wrote:
> >>
> >> We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the
> >> new generation C3 machines (generally c3.8xl) and have been experiencing
> >> frequent system reboots.
> >>
> >> Due to this issue
> >> (
> http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E
> )
> >> we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
> >> machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets
> of
> >> machines seem equally likely to experience reboots, although the 3.10
> >> machines do not come back unaided.
> >>
> >> It seems that the kernel runs into problems in the OOM handler, and we
> see
> >> traces such as:
> >>
> >> [378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
> >> (https://gist.github.com/wsorenson/d2a12f1892b43aa28936)
> >>
> >> Is this possibly related to
> >> https://issues.apache.org/jira/browse/MESOS-662 ?
> >>
> >> Any guidance on how to debug further or if this is a known issue with
> >> certain mesos versions? Some sleuthing indicates that a patch for the
> above
> >> may have been added, removed, and added again later.
> >>
> >> Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report
> >> some cgroup deadlock fixes. We are testing this out asap.
> >>
> >> Thanks,
> >>
> >> -Whitney
> >
> >
>

Re: cgroups OOM handler causing lockups?

Posted by Ian Downes <ia...@gmail.com>.
Hi Whitney,

As Vinod said, 0.18.0 will ensure the kernel is set handle OOM
conditions. The patches you linked are refactors that should not have
changed the behavior since 0.18.0. Could you please double check that
/sys/fs/cgroup/memory/memory.oom.control has "oom_kill_disable 0"?

Can you attempt to reproduce this outside of Mesos by running a
process inside a manually created memory cgroup. Something like the dd
command in the thread you linked should trigger the OOM handler to run
and probably kill processes. Or, perhaps run your java process with a
much lower memory limit.

>From the trace you provided I see mention of ext4 write so it looks
like the OOM handler is indeed trying to flush dirty pages to disk.
Are you running this on EBS? If so, there could be timing issues here
that the kernel isn't handling well; can you test using just the
ephemeral disk(s)?

Please do let us know how this goes!

Ian

On Tue, Jul 1, 2014 at 10:17 AM, Vinod Kone <vi...@gmail.com> wrote:
> Hey Whitney,
>
> I'll let Ian Downes comment on the specific patches you linked, but at a
> high level the bug in MESOS-662 was due to Mesos trying to handle OOM
> situations in user space instead of letting kernel handle it. We have since
> then changed the behavior to let Kernel handle the OOM. You can confirm this
> by checking "oom.control" file in the cgroup of your container (it should
> say 'oom_kill_disable 0').
>
>
> On Tue, Jul 1, 2014 at 9:12 AM, Whitney Sorenson <ws...@hubspot.com>
> wrote:
>>
>> We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the
>> new generation C3 machines (generally c3.8xl) and have been experiencing
>> frequent system reboots.
>>
>> Due to this issue
>> (http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E)
>> we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
>> machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets of
>> machines seem equally likely to experience reboots, although the 3.10
>> machines do not come back unaided.
>>
>> It seems that the kernel runs into problems in the OOM handler, and we see
>> traces such as:
>>
>> [378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
>> (https://gist.github.com/wsorenson/d2a12f1892b43aa28936)
>>
>> Is this possibly related to
>> https://issues.apache.org/jira/browse/MESOS-662 ?
>>
>> Any guidance on how to debug further or if this is a known issue with
>> certain mesos versions? Some sleuthing indicates that a patch for the above
>> may have been added, removed, and added again later.
>>
>> Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report
>> some cgroup deadlock fixes. We are testing this out asap.
>>
>> Thanks,
>>
>> -Whitney
>
>

Re: cgroups OOM handler causing lockups?

Posted by Vinod Kone <vi...@gmail.com>.
Hey Whitney,

I'll let Ian Downes comment on the specific patches you linked, but at a
high level the bug in MESOS-662 was due to Mesos trying to handle OOM
situations in user space instead of letting kernel handle it. We have since
then changed the behavior to let Kernel handle the OOM. You can confirm
this by checking "oom.control" file in the cgroup of your container (it
should say 'oom_kill_disable 0').


On Tue, Jul 1, 2014 at 9:12 AM, Whitney Sorenson <ws...@hubspot.com>
wrote:

> We've been running a few clusters on Amazon EC2 with mesos 0.18.0 on the
> new generation C3 machines (generally c3.8xl) and have been experiencing
> frequent system reboots.
>
> Due to this issue (
> http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3CCAJRB3TEj%2Bx4VRYicJM7aj7avcjr6QeXR8BmSUehrc6_tV62DLw%40mail.gmail.com%3E)
> we have been experimenting with some 3.10.25-1.el6.elrepo.x86_64 kernel
> machines (the rest of the cluster is 2.6.32-431.el6.x86_64). Both sets of
> machines seem equally likely to experience reboots, although the 3.10
> machines do not come back unaided.
>
> It seems that the kernel runs into problems in the OOM handler, and we see
> traces such as:
>
> [378328.089052] BUG: soft lockup - CPU#17 stuck for 22s! [java7:23300]
> (https://gist.github.com/wsorenson/d2a12f1892b43aa28936)
>
> Is this possibly related to
> https://issues.apache.org/jira/browse/MESOS-662 ?
>
> Any guidance on how to debug further or if this is a known issue with
> certain mesos versions? Some sleuthing indicates that a patch for the above
> may have been added, removed
> <https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=f90fe7641ea8f7066a6a1171a24ddaa8dc30e789>,
> and added again
> <https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=326aa493fb445302137d538d456712249504d251>
> later.
>
> Amazon has suggested using 3.10.45-1.el6.elrepo.x86_64 since they report
> some cgroup deadlock fixes. We are testing this out asap.
>
> Thanks,
>
> -Whitney
>