You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Thomas Petr <tp...@hubspot.com> on 2014/06/17 23:23:34 UTC

cgroups memory isolation

Hello,

We're running Mesos 0.18.0 with cgroups isolation, and have run into
situations where lots of file I/O causes tasks to be killed due to
exceeding memory limits. Here's an example:
https://gist.github.com/tpetr/ce5d80a0de9f713765f0

We were under the impression that if cache was using a lot of memory it
would be reclaimed *before* the OOM process decides to kills the task. Is
this accurate? We also found MESOS-762
<https://issues.apache.org/jira/browse/MESOS-762> while trying to diagnose
-- could this be a regression?

Thanks,
Tom

Re: cgroups memory isolation

Posted by Benjamin Mahler <be...@gmail.com>.

+Ian Downes who is knowledgable about the OOMing behavior in the kernel.

>From your logs, it doesn't look related to MESOS-762, that was a bug in
0.14.0.


On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <tp...@hubspot.com> wrote:

> Hello,
>
> We're running Mesos 0.18.0 with cgroups isolation, and have run into
> situations where lots of file I/O causes tasks to be killed due to
> exceeding memory limits. Here's an example:
> https://gist.github.com/tpetr/ce5d80a0de9f713765f0
>
> We were under the impression that if cache was using a lot of memory it
> would be reclaimed *before* the OOM process decides to kills the task. Is
> this accurate? We also found MESOS-762
> <https://issues.apache.org/jira/browse/MESOS-762> while trying to
> diagnose -- could this be a regression?
>
> Thanks,
> Tom
>

Re: cgroups memory isolation

Posted by Thomas Petr <tp...@hubspot.com>.

Eric pointed out that I had a typo in the instance type -- it's a
c3.8xlarge (containing SSDs, which could make a difference here).


On Wed, Jun 18, 2014 at 10:36 AM, Thomas Petr <tp...@hubspot.com> wrote:

> Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
> kernel.
>
> I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
> some weird results. I initially gave the task 256 MB, and it never exceeded
> the memory allocation (I killed the task manually after 5 minutes when the
> file hit 50 GB). Then I noticed your example was 128 MB, so I resized and
> tried again. It exceeded memory
> <https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82> almost
> immediately. The next (replacement) task our framework started ran
> successfully and never exceeded memory. I watched nr_dirty and it
> fluctuated between 10000 to 14000 when the task is running. The slave host
> is a c3.xlarge in EC2, if it makes a difference.
>
> As Mesos users, we'd like an isolation strategy that isn't affected by
> cache this much -- it makes it harder for us to appropriately size things.
> Is it possible through Mesos or cgroups itself to make the page cache not
> count towards the total memory consumption? If the answer is no, do you
> think it'd be worth looking at using Docker for isolation instead?
>
> -Tom
>
>
> On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes <ia...@gmail.com> wrote:
>
>> Hello Thomas,
>>
>> Your impression is mostly correct: the kernel will *try* to reclaim
>> memory by writing out dirty pages before killing processes in a cgroup
>> but if it's unable to reclaim sufficient pages within some interval (I
>> don't recall this off-hand) then it will start killing things.
>>
>> We observed this on a 3.4 kernel where we could overwhelm the disk
>> subsystem and trigger an oom. Just how quickly this happens depends on
>> how fast you're writing compared to how fast your disk subsystem can
>> write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
>> contained in a memory cgroup will fill the cache quickly, reach its
>> limit and get oom'ed. We were not able to reproduce this under 3.10
>> and 3.11 kernels. Which kernel are you using?
>>
>> Example: under 3.4:
>>
>> [idownes@hostname tmp]$ cat /proc/self/cgroup
>> 6:perf_event:/
>> 4:memory:/test
>> 3:freezer:/
>> 2:cpuacct:/
>> 1:cpu:/
>> [idownes@hostname tmp]$ cat
>> /sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
>> 134217728
>> [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
>> Killed
>> [idownes@hostname tmp]$ ls -lah lotsazeros
>> -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
>>
>>
>> You can also look in /proc/vmstat at nr_dirty to see how many dirty
>> pages there are (system wide). If you wrote at a rate sustainable by
>> your disk subsystem then you would see a sawtooth pattern _/|_/| ...
>> (use something like watch) as the cgroup approached its limit and the
>> kernel flushed dirty pages to bring it down.
>>
>> This might be an interesting read:
>>
>> http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>>
>> Hope this helps! Please do let us know if you're seeing this on a
>> kernel >= 3.10, otherwise it's likely this is a kernel issue rather
>> than something with Mesos.
>>
>> Thanks,
>> Ian
>>
>>
>> On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <tp...@hubspot.com> wrote:
>> > Hello,
>> >
>> > We're running Mesos 0.18.0 with cgroups isolation, and have run into
>> > situations where lots of file I/O causes tasks to be killed due to
>> exceeding
>> > memory limits. Here's an example:
>> > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
>> >
>> > We were under the impression that if cache was using a lot of memory it
>> > would be reclaimed *before* the OOM process decides to kills the task.
>> Is
>> > this accurate? We also found MESOS-762 while trying to diagnose -- could
>> > this be a regression?
>> >
>> > Thanks,
>> > Tom
>>
>
>

Re: cgroups memory isolation

Posted by Tim St Clair <ts...@redhat.com>.

https://issues.apache.org/jira/browse/MESOS-1516 

----- Original Message -----

> From: "Vinod Kone" <vi...@gmail.com>
> To: user@mesos.apache.org
> Cc: "Ian Downes" <ia...@gmail.com>, "Eric Abbott" <ea...@hubspot.com>
> Sent: Thursday, June 19, 2014 2:35:20 PM
> Subject: Re: cgroups memory isolation

> On Thu, Jun 19, 2014 at 11:33 AM, Sharma Podila < spodila@netflix.com >
> wrote:

> > Yeah, having soft-limit for memory seems like the right thing to do
> > immediately. The only problem left to solve being that it would be nicer to
> > throttle I/O instead of OOM for high rate I/O jobs. Hopefully the soft
> > limits on memory push this problem to only the extreme edge cases.
> 

> The reason that Mesos uses hard limits for memory and cpu is to provide
> predictability for the users/tasks. For example, some users/tasks don't want
> to be in a place where the task has been improperly sized but was humming
> along fine because it was using idle resources on the machine (soft limits)
> but during crunch time (e.g., peak workload) cannot work as well because the
> machine had multiple tasks all utilizing their full allocations. In other
> words, this provides the users the ability to better predict their SLAs.

> That said, in some cases the tight SLAs probably don't make sense (e.g.,
> batch jobs). That is the reason we let operators configure soft and hard
> limits for cpu. Unless I misunderstand how memory soft limits work (
> https://www.kernel.org/doc/Documentation/cgroups/memory.txt ) I don't see
> why we can't provide a similar soft limit option for memory.

> IOW, feel free to file a ticket :)

-- 
Cheers, 
Tim 
Freedom, Features, Friends, First -> Fedora 
https://fedoraproject.org/wiki/SIGs/bigdata

Re: cgroups memory isolation

Posted by Vinod Kone <vi...@gmail.com>.

On Thu, Jun 19, 2014 at 11:33 AM, Sharma Podila <sp...@netflix.com> wrote:

> Yeah, having soft-limit for memory seems like the right thing to do
> immediately. The only problem left to solve being that it would be nicer to
> throttle I/O instead of OOM for high rate I/O jobs. Hopefully the soft
> limits on memory push this problem to only the extreme edge cases.
>

The reason that Mesos uses hard limits for memory and cpu is to provide
predictability for the users/tasks. For example, some users/tasks don't
want to be in a place where the task has been improperly sized but was
humming along fine because it was using idle resources on the machine (soft
limits) but during crunch time (e.g., peak workload) cannot work as well
because the machine had multiple tasks all utilizing their full
allocations. In other words, this provides the users the ability to better
predict their SLAs.

That said, in some cases the tight SLAs probably don't make sense (e.g.,
batch jobs). That is the reason we let operators configure soft and hard
limits for cpu. Unless I misunderstand how memory soft limits work (
https://www.kernel.org/doc/Documentation/cgroups/memory.txt) I don't see
why we can't provide a similar soft limit option for memory.

IOW, feel free to file a ticket :)

Re: cgroups memory isolation

Posted by Sharma Podila <sp...@netflix.com>.

Yeah, having soft-limit for memory seems like the right thing to do
immediately. The only problem left to solve being that it would be nicer to
throttle I/O instead of OOM for high rate I/O jobs. Hopefully the soft
limits on memory push this problem to only the extreme edge cases.

Agreed on still enforcing limits in general. This tends be on an ongoing
issue from the operations perspective, I've had my share of dealing with
it, and I am sure I will continue to do so. Sometimes users can't estimate,
sometimes jobs' memory footprint changes drastically with minor changes,
etc. Memory usage prediction based on historic usage and reactive resizing
based on actual usage are two tools of the trade.

BTW, by resize, did you mean cgrops memory limits can be resized for
running jobs? That's nice to know (am relatively new to cgroups).



On Thu, Jun 19, 2014 at 10:55 AM, Tim St Clair <ts...@redhat.com> wrote:

> Awesome response!
>
> inline below -
>
> ------------------------------
>
> *From: *"Sharma Podila" <sp...@netflix.com>
> *To: *user@mesos.apache.org
> *Cc: *"Ian Downes" <ia...@gmail.com>, "Eric Abbott" <
> eabbott@hubspot.com>
> *Sent: *Thursday, June 19, 2014 11:54:34 AM
>
> *Subject: *Re: cgroups memory isolation
>
> Purely from a user expectation point of view, I am wondering if such an
> "abuse" (overuse?) of I/O bandwidth/rate should translate into I/O
> bandwidth getting throttled for the job instead of it manifesting into an
> OOM that results in a job kill. Such I/O overuse translating into memory
> overuse seems like an implementation detail (for lack of a better phrase)
> of the OS that uses cache'ing. It's not like the job asked for its memory
> to be used up for I/O cache'ing :-)
>
> In cgroups, you could optionally specify the memory limit as soft, vs.
> hard (OOM).
>
>
>
> I do see that this isn't Mesos specific, but, rather a containerization
> artifact that is inevitable in a shared resource environment.
>
> That said, specifying memory size for jobs is not trivial in a shared
> resource environment. Conservative safe margins do help prevent OOMs, but,
> they also come with the side effect of fragmenting resources and reducing
> utilization. In some cases, they can cause job starvation to some extent,
> if most available memory is allocated to the conservative buffering for
> every job.
>
> Yup, unless you develop tuning models / hunting algorithms.  You need some
> level of global visibility & history.
>
> Another approach that could help, if feasible, is to have containers with
> elastic boundaries (different from over-subscription) that manage things
> such that sum of actual usage of all containers is <= system resources.
> This helps when not all jobs have peak use of resources simultaneously.
>
>
> You "could" use soft limits & resize, I like to call it the "push-over"
> policy.  If the limits are not enforced, what prevents abusive users in
> absence of global visibility?
>
> IMHO - having soft c-group memory limits being an option seems to be the
> right play given the environment.
>
> Thoughts?
>
>
>
> On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair <ts...@redhat.com> wrote:
>
>> FWIW -  There is classic grid mantra that applies here.  Test your
>> workflow on an upper bound, then over provision to be safe.
>>
>> Mesos is no different then SGE, PBS, LSF, Condor, etc.
>>
>> Also, there is no hunting algo for "jobs", that would have to live
>> outside of mesos itself, on some batch system built atop.
>>
>> Cheers,
>> Tim
>>
>> ------------------------------
>>
>> *From: *"Thomas Petr" <tp...@hubspot.com>
>> *To: *"Ian Downes" <ia...@gmail.com>
>> *Cc: *user@mesos.apache.org, "Eric Abbott" <ea...@hubspot.com>
>> *Sent: *Wednesday, June 18, 2014 9:36:51 AM
>> *Subject: *Re: cgroups memory isolation
>>
>>
>> Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
>> kernel.
>>
>> I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
>> some weird results. I initially gave the task 256 MB, and it never exceeded
>> the memory allocation (I killed the task manually after 5 minutes when the
>> file hit 50 GB). Then I noticed your example was 128 MB, so I resized and
>> tried again. It exceeded memory
>> <https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82> almost
>> immediately. The next (replacement) task our framework started ran
>> successfully and never exceeded memory. I watched nr_dirty and it
>> fluctuated between 10000 to 14000 when the task is running. The slave host
>> is a c3.xlarge in EC2, if it makes a difference.
>>
>> As Mesos users, we'd like an isolation strategy that isn't affected by
>> cache this much -- it makes it harder for us to appropriately size things.
>> Is it possible through Mesos or cgroups itself to make the page cache not
>> count towards the total memory consumption? If the answer is no, do you
>> think it'd be worth looking at using Docker for isolation instead?
>>
>> -Tom
>>
>>
>> On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes <ia...@gmail.com> wrote:
>>
>>> Hello Thomas,
>>>
>>> Your impression is mostly correct: the kernel will *try* to reclaim
>>> memory by writing out dirty pages before killing processes in a cgroup
>>> but if it's unable to reclaim sufficient pages within some interval (I
>>> don't recall this off-hand) then it will start killing things.
>>>
>>> We observed this on a 3.4 kernel where we could overwhelm the disk
>>> subsystem and trigger an oom. Just how quickly this happens depends on
>>> how fast you're writing compared to how fast your disk subsystem can
>>> write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
>>> contained in a memory cgroup will fill the cache quickly, reach its
>>> limit and get oom'ed. We were not able to reproduce this under 3.10
>>> and 3.11 kernels. Which kernel are you using?
>>>
>>> Example: under 3.4:
>>>
>>> [idownes@hostname tmp]$ cat /proc/self/cgroup
>>> 6:perf_event:/
>>> 4:memory:/test
>>> 3:freezer:/
>>> 2:cpuacct:/
>>> 1:cpu:/
>>> [idownes@hostname tmp]$ cat
>>> /sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
>>> 134217728
>>> [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
>>> Killed
>>> [idownes@hostname tmp]$ ls -lah lotsazeros
>>> -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
>>>
>>>
>>> You can also look in /proc/vmstat at nr_dirty to see how many dirty
>>> pages there are (system wide). If you wrote at a rate sustainable by
>>> your disk subsystem then you would see a sawtooth pattern _/|_/| ...
>>> (use something like watch) as the cgroup approached its limit and the
>>> kernel flushed dirty pages to bring it down.
>>>
>>> This might be an interesting read:
>>>
>>> http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>>>
>>> Hope this helps! Please do let us know if you're seeing this on a
>>> kernel >= 3.10, otherwise it's likely this is a kernel issue rather
>>> than something with Mesos.
>>>
>>> Thanks,
>>> Ian
>>>
>>>
>>> On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <tp...@hubspot.com> wrote:
>>> > Hello,
>>> >
>>> > We're running Mesos 0.18.0 with cgroups isolation, and have run into
>>> > situations where lots of file I/O causes tasks to be killed due to
>>> exceeding
>>> > memory limits. Here's an example:
>>> > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
>>> >
>>> > We were under the impression that if cache was using a lot of memory it
>>> > would be reclaimed *before* the OOM process decides to kills the task.
>>> Is
>>> > this accurate? We also found MESOS-762 while trying to diagnose --
>>> could
>>> > this be a regression?
>>> >
>>> > Thanks,
>>> > Tom
>>>
>>
>>
>>
>>
>> --
>> Cheers,
>> Tim
>> Freedom, Features, Friends, First -> Fedora
>> https://fedoraproject.org/wiki/SIGs/bigdata
>>
>
>
>
>
> --
> Cheers,
> Tim
> Freedom, Features, Friends, First -> Fedora
> https://fedoraproject.org/wiki/SIGs/bigdata
>

Re: cgroups memory isolation

Posted by Tim St Clair <ts...@redhat.com>.

Awesome response! 

inline below - 

----- Original Message -----

> From: "Sharma Podila" <sp...@netflix.com>
> To: user@mesos.apache.org
> Cc: "Ian Downes" <ia...@gmail.com>, "Eric Abbott" <ea...@hubspot.com>
> Sent: Thursday, June 19, 2014 11:54:34 AM
> Subject: Re: cgroups memory isolation

> Purely from a user expectation point of view, I am wondering if such an
> "abuse" (overuse?) of I/O bandwidth/rate should translate into I/O bandwidth
> getting throttled for the job instead of it manifesting into an OOM that
> results in a job kill. Such I/O overuse translating into memory overuse
> seems like an implementation detail (for lack of a better phrase) of the OS
> that uses cache'ing. It's not like the job asked for its memory to be used
> up for I/O cache'ing :-)

In cgroups, you could optionally specify the memory limit as soft, vs. hard (OOM). 

> I do see that this isn't Mesos specific, but, rather a containerization
> artifact that is inevitable in a shared resource environment.

> That said, specifying memory size for jobs is not trivial in a shared
> resource environment. Conservative safe margins do help prevent OOMs, but,
> they also come with the side effect of fragmenting resources and reducing
> utilization. In some cases, they can cause job starvation to some extent, if
> most available memory is allocated to the conservative buffering for every
> job.

Yup, unless you develop tuning models / hunting algorithms. You need some level of global visibility & history. 

> Another approach that could help, if feasible, is to have containers with
> elastic boundaries (different from over-subscription) that manage things
> such that sum of actual usage of all containers is <= system resources. This
> helps when not all jobs have peak use of resources simultaneously.

You "could" use soft limits & resize, I like to call it the "push-over" policy. If the limits are not enforced, what prevents abusive users in absence of global visibility? 

IMHO - having soft c-group memory limits being an option seems to be the right play given the environment. 

Thoughts? 

> On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair < tstclair@redhat.com > wrote:

> > FWIW - There is classic grid mantra that applies here. Test your workflow
> > on
> > an upper bound, then over provision to be safe.
> 

> > Mesos is no different then SGE, PBS, LSF, Condor, etc.
> 
> > Also, there is no hunting algo for "jobs", that would have to live outside
> > of
> > mesos itself, on some batch system built atop.
> 

> > Cheers,
> 
> > Tim
> 

> > > From: "Thomas Petr" < tpetr@hubspot.com >
> > 
> 
> > > To: "Ian Downes" < ian.downes@gmail.com >
> > 
> 
> > > Cc: user@mesos.apache.org , "Eric Abbott" < eabbott@hubspot.com >
> > 
> 
> > > Sent: Wednesday, June 18, 2014 9:36:51 AM
> > 
> 
> > > Subject: Re: cgroups memory isolation
> > 
> 

> > > Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
> > > kernel.
> > 
> 

> > > I ran ` dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
> > > some
> > > weird results. I initially gave the task 256 MB, and it never exceeded
> > > the
> > > memory allocation (I killed the task manually after 5 minutes when the
> > > file
> > > hit 50 GB). Then I noticed your example was 128 MB, so I resized and
> > > tried
> > > again. It exceeded memory almost immediately. The next (replacement) task
> > > our framework started ran successfully and never exceeded memory. I
> > > watched
> > > nr_dirty and it fluctuated between 10000 to 14000 when the task is
> > > running.
> > > The slave host is a c3.xlarge in EC2, if it makes a difference.
> > 
> 

> > > As Mesos users, we'd like an isolation strategy that isn't affected by
> > > cache
> > > this much -- it makes it harder for us to appropriately size things. Is
> > > it
> > > possible through Mesos or cgroups itself to make the page cache not count
> > > towards the total memory consumption? If the answer is no, do you think
> > > it'd
> > > be worth looking at using Docker for isolation instead?
> > 
> 

> > > - Tom
> > 
> 

> > > On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes < ian.downes@gmail.com >
> > > wrote:
> > 
> 

> > > > Hello Thomas,
> > > 
> > 
> 

> > > > Your impression is mostly correct: the kernel will *try* to reclaim
> > > 
> > 
> 
> > > > memory by writing out dirty pages before killing processes in a cgroup
> > > 
> > 
> 
> > > > but if it's unable to reclaim sufficient pages within some interval (I
> > > 
> > 
> 
> > > > don't recall this off-hand) then it will start killing things.
> > > 
> > 
> 

> > > > We observed this on a 3.4 kernel where we could overwhelm the disk
> > > 
> > 
> 
> > > > subsystem and trigger an oom. Just how quickly this happens depends on
> > > 
> > 
> 
> > > > how fast you're writing compared to how fast your disk subsystem can
> > > 
> > 
> 
> > > > write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
> > > 
> > 
> 
> > > > contained in a memory cgroup will fill the cache quickly, reach its
> > > 
> > 
> 
> > > > limit and get oom'ed. We were not able to reproduce this under 3.10
> > > 
> > 
> 
> > > > and 3.11 kernels. Which kernel are you using?
> > > 
> > 
> 

> > > > Example: under 3.4:
> > > 
> > 
> 

> > > > [idownes@hostname tmp]$ cat /proc/self/cgroup
> > > 
> > 
> 
> > > > 6:perf_event:/
> > > 
> > 
> 
> > > > 4:memory:/test
> > > 
> > 
> 
> > > > 3:freezer:/
> > > 
> > 
> 
> > > > 2:cpuacct:/
> > > 
> > 
> 
> > > > 1:cpu:/
> > > 
> > 
> 
> > > > [idownes@hostname tmp]$ cat
> > > 
> > 
> 
> > > > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB
> > > 
> > 
> 
> > > > 134217728
> > > 
> > 
> 
> > > > [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
> > > 
> > 
> 
> > > > Killed
> > > 
> > 
> 
> > > > [idownes@hostname tmp]$ ls -lah lotsazeros
> > > 
> > 
> 
> > > > -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
> > > 
> > 
> 

> > > > You can also look in /proc/vmstat at nr_dirty to see how many dirty
> > > 
> > 
> 
> > > > pages there are (system wide). If you wrote at a rate sustainable by
> > > 
> > 
> 
> > > > your disk subsystem then you would see a sawtooth pattern _/|_/| ...
> > > 
> > 
> 
> > > > (use something like watch) as the cgroup approached its limit and the
> > > 
> > 
> 
> > > > kernel flushed dirty pages to bring it down.
> > > 
> > 
> 

> > > > This might be an interesting read:
> > > 
> > 
> 
> > > > http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
> > > 
> > 
> 

> > > > Hope this helps! Please do let us know if you're seeing this on a
> > > 
> > 
> 
> > > > kernel >= 3.10, otherwise it's likely this is a kernel issue rather
> > > 
> > 
> 
> > > > than something with Mesos.
> > > 
> > 
> 

> > > > Thanks,
> > > 
> > 
> 
> > > > Ian
> > > 
> > 
> 

> > > > On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr < tpetr@hubspot.com >
> > > > wrote:
> > > 
> > 
> 
> > > > > Hello,
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > We're running Mesos 0.18.0 with cgroups isolation, and have run into
> > > 
> > 
> 
> > > > > situations where lots of file I/O causes tasks to be killed due to
> > > > > exceeding
> > > 
> > 
> 
> > > > > memory limits. Here's an example:
> > > 
> > 
> 
> > > > > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > We were under the impression that if cache was using a lot of memory
> > > > > it
> > > 
> > 
> 
> > > > > would be reclaimed *before* the OOM process decides to kills the
> > > > > task.
> > > > > Is
> > > 
> > 
> 
> > > > > this accurate? We also found MESOS-762 while trying to diagnose --
> > > > > could
> > > 
> > 
> 
> > > > > this be a regression?
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Thanks,
> > > 
> > 
> 
> > > > > Tom
> > > 
> > 
> 

> > --
> 
> > Cheers,
> 
> > Tim
> 
> > Freedom, Features, Friends, First -> Fedora
> 
> > https://fedoraproject.org/wiki/SIGs/bigdata
> 

-- 
Cheers, 
Tim 
Freedom, Features, Friends, First -> Fedora 
https://fedoraproject.org/wiki/SIGs/bigdata

Re: cgroups memory isolation

Posted by Sharma Podila <sp...@netflix.com>.

Purely from a user expectation point of view, I am wondering if such an
"abuse" (overuse?) of I/O bandwidth/rate should translate into I/O
bandwidth getting throttled for the job instead of it manifesting into an
OOM that results in a job kill. Such I/O overuse translating into memory
overuse seems like an implementation detail (for lack of a better phrase)
of the OS that uses cache'ing. It's not like the job asked for its memory
to be used up for I/O cache'ing :-)

I do see that this isn't Mesos specific, but, rather a containerization
artifact that is inevitable in a shared resource environment.

That said, specifying memory size for jobs is not trivial in a shared
resource environment. Conservative safe margins do help prevent OOMs, but,
they also come with the side effect of fragmenting resources and reducing
utilization. In some cases, they can cause job starvation to some extent,
if most available memory is allocated to the conservative buffering for
every job.
Another approach that could help, if feasible, is to have containers with
elastic boundaries (different from over-subscription) that manage things
such that sum of actual usage of all containers is <= system resources.
This helps when not all jobs have peak use of resources simultaneously.


On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair <ts...@redhat.com> wrote:

> FWIW -  There is classic grid mantra that applies here.  Test your
> workflow on an upper bound, then over provision to be safe.
>
> Mesos is no different then SGE, PBS, LSF, Condor, etc.
>
> Also, there is no hunting algo for "jobs", that would have to live outside
> of mesos itself, on some batch system built atop.
>
> Cheers,
> Tim
>
> ------------------------------
>
> *From: *"Thomas Petr" <tp...@hubspot.com>
> *To: *"Ian Downes" <ia...@gmail.com>
> *Cc: *user@mesos.apache.org, "Eric Abbott" <ea...@hubspot.com>
> *Sent: *Wednesday, June 18, 2014 9:36:51 AM
> *Subject: *Re: cgroups memory isolation
>
>
> Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
> kernel.
>
> I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
> some weird results. I initially gave the task 256 MB, and it never exceeded
> the memory allocation (I killed the task manually after 5 minutes when the
> file hit 50 GB). Then I noticed your example was 128 MB, so I resized and
> tried again. It exceeded memory
> <https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82> almost
> immediately. The next (replacement) task our framework started ran
> successfully and never exceeded memory. I watched nr_dirty and it
> fluctuated between 10000 to 14000 when the task is running. The slave host
> is a c3.xlarge in EC2, if it makes a difference.
>
> As Mesos users, we'd like an isolation strategy that isn't affected by
> cache this much -- it makes it harder for us to appropriately size things.
> Is it possible through Mesos or cgroups itself to make the page cache not
> count towards the total memory consumption? If the answer is no, do you
> think it'd be worth looking at using Docker for isolation instead?
>
> -Tom
>
>
> On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes <ia...@gmail.com> wrote:
>
>> Hello Thomas,
>>
>> Your impression is mostly correct: the kernel will *try* to reclaim
>> memory by writing out dirty pages before killing processes in a cgroup
>> but if it's unable to reclaim sufficient pages within some interval (I
>> don't recall this off-hand) then it will start killing things.
>>
>> We observed this on a 3.4 kernel where we could overwhelm the disk
>> subsystem and trigger an oom. Just how quickly this happens depends on
>> how fast you're writing compared to how fast your disk subsystem can
>> write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
>> contained in a memory cgroup will fill the cache quickly, reach its
>> limit and get oom'ed. We were not able to reproduce this under 3.10
>> and 3.11 kernels. Which kernel are you using?
>>
>> Example: under 3.4:
>>
>> [idownes@hostname tmp]$ cat /proc/self/cgroup
>> 6:perf_event:/
>> 4:memory:/test
>> 3:freezer:/
>> 2:cpuacct:/
>> 1:cpu:/
>> [idownes@hostname tmp]$ cat
>> /sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
>> 134217728
>> [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
>> Killed
>> [idownes@hostname tmp]$ ls -lah lotsazeros
>> -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
>>
>>
>> You can also look in /proc/vmstat at nr_dirty to see how many dirty
>> pages there are (system wide). If you wrote at a rate sustainable by
>> your disk subsystem then you would see a sawtooth pattern _/|_/| ...
>> (use something like watch) as the cgroup approached its limit and the
>> kernel flushed dirty pages to bring it down.
>>
>> This might be an interesting read:
>>
>> http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>>
>> Hope this helps! Please do let us know if you're seeing this on a
>> kernel >= 3.10, otherwise it's likely this is a kernel issue rather
>> than something with Mesos.
>>
>> Thanks,
>> Ian
>>
>>
>> On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <tp...@hubspot.com> wrote:
>> > Hello,
>> >
>> > We're running Mesos 0.18.0 with cgroups isolation, and have run into
>> > situations where lots of file I/O causes tasks to be killed due to
>> exceeding
>> > memory limits. Here's an example:
>> > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
>> >
>> > We were under the impression that if cache was using a lot of memory it
>> > would be reclaimed *before* the OOM process decides to kills the task.
>> Is
>> > this accurate? We also found MESOS-762 while trying to diagnose -- could
>> > this be a regression?
>> >
>> > Thanks,
>> > Tom
>>
>
>
>
>
> --
> Cheers,
> Tim
> Freedom, Features, Friends, First -> Fedora
> https://fedoraproject.org/wiki/SIGs/bigdata
>

Re: cgroups memory isolation

Posted by Tim St Clair <ts...@redhat.com>.

FWIW - There is classic grid mantra that applies here. Test your workflow on an upper bound, then over provision to be safe. 

Mesos is no different then SGE, PBS, LSF, Condor, etc. 
Also, there is no hunting algo for "jobs", that would have to live outside of mesos itself, on some batch system built atop. 

Cheers, 
Tim 

----- Original Message -----

> From: "Thomas Petr" <tp...@hubspot.com>
> To: "Ian Downes" <ia...@gmail.com>
> Cc: user@mesos.apache.org, "Eric Abbott" <ea...@hubspot.com>
> Sent: Wednesday, June 18, 2014 9:36:51 AM
> Subject: Re: cgroups memory isolation

> Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32 kernel.

> I ran ` dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got some
> weird results. I initially gave the task 256 MB, and it never exceeded the
> memory allocation (I killed the task manually after 5 minutes when the file
> hit 50 GB). Then I noticed your example was 128 MB, so I resized and tried
> again. It exceeded memory almost immediately. The next (replacement) task
> our framework started ran successfully and never exceeded memory. I watched
> nr_dirty and it fluctuated between 10000 to 14000 when the task is running.
> The slave host is a c3.xlarge in EC2, if it makes a difference.

> As Mesos users, we'd like an isolation strategy that isn't affected by cache
> this much -- it makes it harder for us to appropriately size things. Is it
> possible through Mesos or cgroups itself to make the page cache not count
> towards the total memory consumption? If the answer is no, do you think it'd
> be worth looking at using Docker for isolation instead?

> - Tom

> On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes < ian.downes@gmail.com > wrote:

> > Hello Thomas,
> 

> > Your impression is mostly correct: the kernel will *try* to reclaim
> 
> > memory by writing out dirty pages before killing processes in a cgroup
> 
> > but if it's unable to reclaim sufficient pages within some interval (I
> 
> > don't recall this off-hand) then it will start killing things.
> 

> > We observed this on a 3.4 kernel where we could overwhelm the disk
> 
> > subsystem and trigger an oom. Just how quickly this happens depends on
> 
> > how fast you're writing compared to how fast your disk subsystem can
> 
> > write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
> 
> > contained in a memory cgroup will fill the cache quickly, reach its
> 
> > limit and get oom'ed. We were not able to reproduce this under 3.10
> 
> > and 3.11 kernels. Which kernel are you using?
> 

> > Example: under 3.4:
> 

> > [idownes@hostname tmp]$ cat /proc/self/cgroup
> 
> > 6:perf_event:/
> 
> > 4:memory:/test
> 
> > 3:freezer:/
> 
> > 2:cpuacct:/
> 
> > 1:cpu:/
> 
> > [idownes@hostname tmp]$ cat
> 
> > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB
> 
> > 134217728
> 
> > [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
> 
> > Killed
> 
> > [idownes@hostname tmp]$ ls -lah lotsazeros
> 
> > -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
> 

> > You can also look in /proc/vmstat at nr_dirty to see how many dirty
> 
> > pages there are (system wide). If you wrote at a rate sustainable by
> 
> > your disk subsystem then you would see a sawtooth pattern _/|_/| ...
> 
> > (use something like watch) as the cgroup approached its limit and the
> 
> > kernel flushed dirty pages to bring it down.
> 

> > This might be an interesting read:
> 
> > http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
> 

> > Hope this helps! Please do let us know if you're seeing this on a
> 
> > kernel >= 3.10, otherwise it's likely this is a kernel issue rather
> 
> > than something with Mesos.
> 

> > Thanks,
> 
> > Ian
> 

> > On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr < tpetr@hubspot.com > wrote:
> 
> > > Hello,
> 
> > >
> 
> > > We're running Mesos 0.18.0 with cgroups isolation, and have run into
> 
> > > situations where lots of file I/O causes tasks to be killed due to
> > > exceeding
> 
> > > memory limits. Here's an example:
> 
> > > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
> 
> > >
> 
> > > We were under the impression that if cache was using a lot of memory it
> 
> > > would be reclaimed *before* the OOM process decides to kills the task. Is
> 
> > > this accurate? We also found MESOS-762 while trying to diagnose -- could
> 
> > > this be a regression?
> 
> > >
> 
> > > Thanks,
> 
> > > Tom
> 

-- 
Cheers, 
Tim 
Freedom, Features, Friends, First -> Fedora 
https://fedoraproject.org/wiki/SIGs/bigdata

Re: cgroups memory isolation

Posted by Thomas Petr <tp...@hubspot.com>.

Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32 kernel.

I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got some
weird results. I initially gave the task 256 MB, and it never exceeded the
memory allocation (I killed the task manually after 5 minutes when the file
hit 50 GB). Then I noticed your example was 128 MB, so I resized and tried
again. It exceeded memory
<https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82> almost
immediately. The next (replacement) task our framework started ran
successfully and never exceeded memory. I watched nr_dirty and it
fluctuated between 10000 to 14000 when the task is running. The slave host
is a c3.xlarge in EC2, if it makes a difference.

As Mesos users, we'd like an isolation strategy that isn't affected by
cache this much -- it makes it harder for us to appropriately size things.
Is it possible through Mesos or cgroups itself to make the page cache not
count towards the total memory consumption? If the answer is no, do you
think it'd be worth looking at using Docker for isolation instead?

-Tom

On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes <ia...@gmail.com> wrote:

> Hello Thomas,
>
> Your impression is mostly correct: the kernel will *try* to reclaim
> memory by writing out dirty pages before killing processes in a cgroup
> but if it's unable to reclaim sufficient pages within some interval (I
> don't recall this off-hand) then it will start killing things.
>
> We observed this on a 3.4 kernel where we could overwhelm the disk
> subsystem and trigger an oom. Just how quickly this happens depends on
> how fast you're writing compared to how fast your disk subsystem can
> write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
> contained in a memory cgroup will fill the cache quickly, reach its
> limit and get oom'ed. We were not able to reproduce this under 3.10
> and 3.11 kernels. Which kernel are you using?
>
> Example: under 3.4:
>
> [idownes@hostname tmp]$ cat /proc/self/cgroup
> 6:perf_event:/
> 4:memory:/test
> 3:freezer:/
> 2:cpuacct:/
> 1:cpu:/
> [idownes@hostname tmp]$ cat
> /sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
> 134217728
> [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
> Killed
> [idownes@hostname tmp]$ ls -lah lotsazeros
> -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros
>
>
> You can also look in /proc/vmstat at nr_dirty to see how many dirty
> pages there are (system wide). If you wrote at a rate sustainable by
> your disk subsystem then you would see a sawtooth pattern _/|_/| ...
> (use something like watch) as the cgroup approached its limit and the
> kernel flushed dirty pages to bring it down.
>
> This might be an interesting read:
>
> http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>
> Hope this helps! Please do let us know if you're seeing this on a
> kernel >= 3.10, otherwise it's likely this is a kernel issue rather
> than something with Mesos.
>
> Thanks,
> Ian
>
>
> On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <tp...@hubspot.com> wrote:
> > Hello,
> >
> > We're running Mesos 0.18.0 with cgroups isolation, and have run into
> > situations where lots of file I/O causes tasks to be killed due to
> exceeding
> > memory limits. Here's an example:
> > https://gist.github.com/tpetr/ce5d80a0de9f713765f0
> >
> > We were under the impression that if cache was using a lot of memory it
> > would be reclaimed *before* the OOM process decides to kills the task. Is
> > this accurate? We also found MESOS-762 while trying to diagnose -- could
> > this be a regression?
> >
> > Thanks,
> > Tom
>

Re: cgroups memory isolation

Posted by Ian Downes <ia...@gmail.com>.

Hello Thomas,

Your impression is mostly correct: the kernel will *try* to reclaim
memory by writing out dirty pages before killing processes in a cgroup
but if it's unable to reclaim sufficient pages within some interval (I
don't recall this off-hand) then it will start killing things.

We observed this on a 3.4 kernel where we could overwhelm the disk
subsystem and trigger an oom. Just how quickly this happens depends on
how fast you're writing compared to how fast your disk subsystem can
write it out. A simple "dd if=/dev/zero of=lotsazeros bs=1M" when
contained in a memory cgroup will fill the cache quickly, reach its
limit and get oom'ed. We were not able to reproduce this under 3.10
and 3.11 kernels. Which kernel are you using?

Example: under 3.4:

[idownes@hostname tmp]$ cat /proc/self/cgroup
6:perf_event:/
4:memory:/test
3:freezer:/
2:cpuacct:/
1:cpu:/
[idownes@hostname tmp]$ cat
/sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
134217728
[idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
Killed
[idownes@hostname tmp]$ ls -lah lotsazeros
-rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros

You can also look in /proc/vmstat at nr_dirty to see how many dirty
pages there are (system wide). If you wrote at a rate sustainable by
your disk subsystem then you would see a sawtooth pattern _/|_/| ...
(use something like watch) as the cgroup approached its limit and the
kernel flushed dirty pages to bring it down.

This might be an interesting read:
http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

Hope this helps! Please do let us know if you're seeing this on a
kernel >= 3.10, otherwise it's likely this is a kernel issue rather
than something with Mesos.

Thanks,
Ian

On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr <tp...@hubspot.com> wrote:
> Hello,
>
> We're running Mesos 0.18.0 with cgroups isolation, and have run into
> situations where lots of file I/O causes tasks to be killed due to exceeding
> memory limits. Here's an example:
> https://gist.github.com/tpetr/ce5d80a0de9f713765f0
>
> We were under the impression that if cache was using a lot of memory it
> would be reclaimed *before* the OOM process decides to kills the task. Is
> this accurate? We also found MESOS-762 while trying to diagnose -- could
> this be a regression?
>
> Thanks,
> Tom