You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Thomas Marshall <tw...@gmail.com> on 2013/05/07 00:34:19 UTC

Re: Review Request: Properly cleanup cgroups tests


> On April 30, 2013, 5:33 p.m., Ben Mahler wrote:
> > src/linux/cgroups.hpp, lines 100-107
> > <https://reviews.apache.org/r/10808/diff/1/?file=285105#file285105line100>
> >
> >     Hm.. I'm not sure I like this new API call. My understanding was that subsystems are "attached" to hierarchies, rather than cgroups.
> >     
> >     The comment on SubsystemInfo.cgroups is:
> >     int cgroups; // Number of cgroups for the subsystem.
> >     
> >     Could you not get the hierarch(y|ies) with the subsystem(s) mounted using cgroups::hierarchy(subsystems), then count the cgroups with cgroups::get(hierarchy)?

I updated the comment to reflect that yes, in cgroups parlance subsystems are 'attached' to hierarchies while cgroups 'associate' subsystems with tasks.

The reason using cgroups::get doesn't work is that it will tell you that the cgroups doesn't exist, since at this point it has been removed from the filesystem. I'm open to suggestions for a better way to implement this API call, though, such as exposing the SubsystemInfo struct and having a call that returns all of the SubsystemInfos 


> On April 30, 2013, 5:33 p.m., Ben Mahler wrote:
> > src/tests/cgroups_tests.cpp, lines 162-175
> > <https://reviews.apache.org/r/10808/diff/1/?file=285107#file285107line162>
> >
> >     I wish there was a simpler way to do this. Do you think this is a kernel bug? (Note that we haven't encountered this on CentOS).
> >     
> >     If so, and it occurs on kernels we want to support, is this only an issue during the tests, or could there be issues in a production environment?

I haven't been able to reproduce this on Linux kernel version 2.6, 3.2, or 3.8, only 3.5, so it does seem to be a kernel bug. Unfortunately, 3.5 happens to be the kernel version that ships with the latest LTS release of Ubuntu, which we in the AMP lab were hoping to support along with CentOS 6.4. 

I believe that this is extremely unlikely to come up in production because it is related to unmounting hierarchies, and the CgroupsIsolator only mounts one hierarchy for each slave, so unless you killed a slave and then immediately brought up another one on the same machine, there shouldn't be a problem (and if you're killing the slave manually, there's no way you'll launch the new one fast enough).

As far as other ways to make this work, I looked into the possibility of, instead of waiting for the cleanup, just retrying the mount some number of times, but there's no way to really make this work for the NoHierarchy test, since it doesn't do any mounting in SetUp.

The simplest alternative would be to add an os::sleep(Seconds(0.1)) to the end of TearDown, which is a little less hack-y in that I haven't been able to find any documentation saying that the write is /proc/cgroups is a logical event to wait on, but it'll slow the tests down, and affects systems where this bug isn't present.


- Thomas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10808/#review19961
-----------------------------------------------------------


On April 26, 2013, 10:47 p.m., Thomas Marshall wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10808/
> -----------------------------------------------------------
> 
> (Updated April 26, 2013, 10:47 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Vinod Kone, and Ben Mahler.
> 
> 
> Description
> -------
> 
> Previously, our the CgroupsAnyHierarchy*Tests were not waiting long enough for the cgroups created to be cleaned up before the test finished, resulting in failures when the next test was unable to create cgroups for itself. This patch attempts to fix that by having CgroupsAnyHierarchy::TearDown wait until the number of cgroups each subsystem is attached to, documented in /proc/cgroups, is the same as it was before the test ran.
> 
> 
> This addresses bug MESOS-449.
>     https://issues.apache.org/jira/browse/MESOS-449
> 
> 
> Diffs
> -----
> 
>   src/linux/cgroups.hpp 3e86828 
>   src/linux/cgroups.cpp 139c3ef 
>   src/tests/cgroups_tests.cpp 3734315 
> 
> Diff: https://reviews.apache.org/r/10808/diff/
> 
> 
> Testing
> -------
> 
> On both Ubuntu and CentOS, on both clean installs and in the presence of pre-existing cgroups on the system (such as those left behind by BalloonFramework failing):
> 
> bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=1000 --gtest_filter=*CgroupsAny*
> 
> 
> Thanks,
> 
> Thomas Marshall
> 
>