You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Erik Weathers <ew...@groupon.com.INVALID> on 2015/12/31 03:55:42 UTC

tasks not being scheduled; cfs_rq for /mesos is missing

I'm trying to figure out a situation where we see tasks in a mesos
container no longer being scheduled by the Linux kernel.  None of the tasks
in the container are zombies, nor are they stuck in "Disk sleep" state.
They are all in Running state.  But if I try to strace the processes the
strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
instruction pointers) are changing at all in these tasks, and they're not
accumulating any cputime.   So the kernel is just not scheduling them.

Despite the behavior described above, these non-running tasks *are* listed
in the run queues of /proc/sched_debug.  Notably, I have observed that on
hosts without this problem that there exist "cfs_rq[N]:/mesos" run queues,
but on the hosts that have the broken scheduling, these run queues don't
exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
/proc/sched_debug.  That is mighty suspicious to me.

I'm curious about:

   - Has anyone seen similar behavior?
   - Are /foo/bar cgroups hierarchical such that /foo missing would prevent
   /foo/bar tasks from being scheduled?  i.e., might that be the root cause of
   why the kernel is ignoring these tasks?
   - What creates the /mesos cfs run queue, and why would that cease to
   exist without the subordinate cgroups being cleaned up?
      - I'm assuming the creation of the "cpu" cgroup with the path
      "/mesos" done by mesos-slave creates this run queue.
      - But I'm not sure how/why it would be removed, since I still see a
      mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos exists).

I'm assuming that this is a kernel bug, and I'm hopeful RedHat has patched
fixes into newer kernel versions that we are running on other hosts (e.g.,
2.6.32-573.7.1.el6).

Setup info:

Kernel version:  2.6.32-431.el6.x86_64
Mesos version:  0.22.1
Containerizer: Mesos
Isolators: Have seen this behavior with both of these configs:
   cgroups/cpu,cgroups/mem
   cgroups/cpu,cgroups/mem,namespaces/pid

Thanks for any insight or help!

- Erik

Re: tasks not being scheduled; cfs_rq for /mesos is missing

Posted by Jojy Varghese <jo...@mesosphere.io>.

Hi Eric,
	Thanks again for the information. The value of 0 for */cgroup/cpu/cpu.cfs_* files looks suspicious. 

Since you are motivated to look into kernel, here is what I think is interesting. 

- The shed_debug print (http://lxr.free-electrons.com/source/kernel/sched_debug.c?v=2.6.34#L163 <http://lxr.free-electrons.com/source/kernel/sched_debug.c?v=2.6.34#L163>) finds  cgroup_subsys_state <http://lxr.free-electrons.com/ident?v=2.6.34;i=cgroup_subsys_state> (http://lxr.free-electrons.com/source/include/linux/cgroup.h?v=2.6.34#L60 <http://lxr.free-electrons.com/source/include/linux/cgroup.h?v=2.6.34#L60>) as NULL for the task group. 
- So what is it that makes css NULL for a task_group? Most likely it was because css_put (http://lxr.free-electrons.com/source/kernel/cgroup.c?v=2.6.34#L4312 <http://lxr.free-electrons.com/source/kernel/cgroup.c?v=2.6.34#L4312>) was called. It would be interesting to use kernel trace/debug/systemtap to see if this happens. My bet is that “rmdir” was called on the cgroup that caused this.


-Jojy



> On Jan 4, 2016, at 7:54 PM, Erik Weathers <ew...@groupon.com.INVALID> wrote:
> 
> hi Jojy,
> 
> Unfortunately, I haven't been able to reproduce this issue on demand, it
> has just happened spontaneously a few times.   So I cannot say for sure if
> it would happen on a newer mesos/kernel version.  I'm thinking of trying to
> force reproduction by creating and destroying a ton of cgroups, since the
> issue does *seem* to possibly correlate with some badly behaved storm
> topologies that are constantly crashing and causing the cgroups to be
> created and destroyed often.
> 
> I have a couple test hosts that are in this bad state right now, so I'm
> trying to get as much info out of them as I can.  I'm thinking of trying
> SystemTap to introspect the kernel's run queue state and see what is
> happening.
> 
> Here is the info you requested:
> 
> */cgroup/cpu files:*
> 
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat /cgroup/cpu/$f; done | head -n20
> ----cpu.cfs_quota_us:----
> 0
> ----cpu.cfs_period_us:----
> 0
> ----cpu.shares:----
> 1024
> ----cpu.stat:----
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> ----tasks:----
> 1
> ...
> 
> */cgroup/cpu/mesos files:*
> 
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat /cgroup/cpu/mesos/$f; done
> ----cpu.cfs_quota_us:----
> -1
> ----cpu.cfs_period_us:----
> 100000
> ----cpu.shares:----
> 1024
> ----cpu.stat:----
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> ----tasks:----
> 
> NOTE: no tasks, and the cpu.cfs_quota_us being -1.  But those are both
> consistent with other hosts that aren't exhibiting this problem.
> 
> */cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e files:*
> 
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat
> /cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e/$f; done
> ----cpu.cfs_quota_us:----
> 1800000
> ----cpu.cfs_period_us:----
> 100000
> ----cpu.shares:----
> 18432
> ----cpu.stat:----
> nr_periods 680868
> nr_throttled 254025
> throttled_time 55400010353125
> ----tasks:----
> 6473
> ...
> 
> - Erik
> 
> On Sun, Jan 3, 2016 at 10:38 AM, Jojy Varghese <jo...@mesosphere.io> wrote:
> 
>> Hi Erik
>>  Happy to work on this with you. Thanks for the details.
>> 
>> As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is
>> the CPU cgroup hierarchy name. I am curious about the contents and cgroups
>> hierarchy when this happens. Could you send the “mesos” hierarchy
>> (directory tree) and contents of files like
>> ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’,  ‘cpu.stat’.
>> 
>> It does look strange that the parent cgroup is missing when child is
>> present.
>> 
>> Also, wondering if you are able to see same issue with latest Mesos and/or
>> kernel?
>> 
>> -Jojy
>> 
>> 
>>> On Jan 2, 2016, at 9:43 PM, Erik Weathers <ew...@groupon.com.INVALID>
>> wrote:
>>> 
>>> hey Jojy,  Thanks for your reply.  Response inline.
>>> 
>>> On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <jo...@mesosphere.io>
>> wrote:
>>> 
>>>>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>>>>> /foo/bar tasks from being scheduled?  i.e., might that be the root
>>>> cause of
>>>>> why the kernel is ignoring these tasks?
>>>> 
>>>> Was curious why you said the above. CPU scheduling shares are a function
>>>> of their parent’s CPU bandwidth.
>>>> 
>>> 
>>> This question arose from an earlier observation in my initial email:
>>> 
>>> In my initial email I pointed out that the contents of /proc/sched_debug
>>> list all of the CFS run queues, but it seems like some of those run
>> queues
>>> are missing on the affected hosts.  i.e., usually they look like this
>> (only
>>> including output for the 1st CPU's CFS run queues):
>>> 
>>> % grep 'cfs_rq\[0\]' /proc/sched_debug
>>> cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
>>> cfs_rq[0]:/mesos
>>> cfs_rq[0]:/
>>> 
>>> But on the problematic hosts, they look like this:
>>> 
>>> % grep 'cfs_rq\[0\]' /proc/sched_debug
>>> cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
>>> cfs_rq[0]:/
>>> 
>>> Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.
>>> 
>>> I'm not sure how that is possible, given my understanding that these
>>> cfs_rq's are created from the special cgroups filesystem having
>> directories
>>> added to it, and since the /cgroup/cpu/mesos dir exists (as well as
>>> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
>>> the CFS run queues for "/mesos" could have been deleted.   I've been
>> trying
>>> to read the kernel cgroup CFS scheduling code, but it's tough for a newb.
>>> 
>>> Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
>>> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not
>> suspicious.
>>> i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
>>> preventing the tasks from being scheduled.  It seems to be that the
>> cgroup
>>> settings of the parent are simply gone from the kernel.  Poof.
>>> 
>>> At this point I'm assuming that the above observation is indeed the root
>>> cause of the problem, and I'm simply hoping that whatever logic deleted
>> the
>>> "/mesos" run queue is fixed in either a newer kernel or newer mesos
>> version.
>>> 
>>> Thanks!
>>> 
>>> - Erik
>>> 
>>> 
>>> 
>>>> 
>>>> -Jojy
>>>> 
>>>> 
>>>>> On Dec 30, 2015, at 6:55 PM, Erik Weathers
>> <ew...@groupon.com.INVALID>
>>>> wrote:
>>>>> 
>>>>> I'm trying to figure out a situation where we see tasks in a mesos
>>>>> container no longer being scheduled by the Linux kernel.  None of the
>>>> tasks
>>>>> in the container are zombies, nor are they stuck in "Disk sleep" state.
>>>>> They are all in Running state.  But if I try to strace the processes
>> the
>>>>> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
>>>>> instruction pointers) are changing at all in these tasks, and they're
>> not
>>>>> accumulating any cputime.   So the kernel is just not scheduling them.
>>>>> 
>>>>> Despite the behavior described above, these non-running tasks *are*
>>>> listed
>>>>> in the run queues of /proc/sched_debug.  Notably, I have observed that
>> on
>>>>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run
>>>> queues,
>>>>> but on the hosts that have the broken scheduling, these run queues
>> don't
>>>>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
>>>>> /proc/sched_debug.  That is mighty suspicious to me.
>>>>> 
>>>>> I'm curious about:
>>>>> 
>>>>> - Has anyone seen similar behavior?
>>>>> - Are /foo/bar cgroups hierarchical such that /foo missing would
>>>> prevent
>>>>> /foo/bar tasks from being scheduled?  i.e., might that be the root
>>>> cause of
>>>>> why the kernel is ignoring these tasks?
>>>>> - What creates the /mesos cfs run queue, and why would that cease to
>>>>> exist without the subordinate cgroups being cleaned up?
>>>>>    - I'm assuming the creation of the "cpu" cgroup with the path
>>>>>    "/mesos" done by mesos-slave creates this run queue.
>>>>>    - But I'm not sure how/why it would be removed, since I still see a
>>>>>    mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
>>>> exists).
>>>>> 
>>>>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
>>>> patched
>>>>> fixes into newer kernel versions that we are running on other hosts
>>>> (e.g.,
>>>>> 2.6.32-573.7.1.el6).
>>>>> 
>>>>> Setup info:
>>>>> 
>>>>> Kernel version:  2.6.32-431.el6.x86_64
>>>>> Mesos version:  0.22.1
>>>>> Containerizer: Mesos
>>>>> Isolators: Have seen this behavior with both of these configs:
>>>>> cgroups/cpu,cgroups/mem
>>>>> cgroups/cpu,cgroups/mem,namespaces/pid
>>>>> 
>>>>> Thanks for any insight or help!
>>>>> 
>>>>> - Erik
>>>> 
>>>> 
>> 
>>

Re: tasks not being scheduled; cfs_rq for /mesos is missing

Posted by Erik Weathers <ew...@groupon.com.INVALID>.

hi Jojy,

Unfortunately, I haven't been able to reproduce this issue on demand, it
has just happened spontaneously a few times.   So I cannot say for sure if
it would happen on a newer mesos/kernel version.  I'm thinking of trying to
force reproduction by creating and destroying a ton of cgroups, since the
issue does *seem* to possibly correlate with some badly behaved storm
topologies that are constantly crashing and causing the cgroups to be
created and destroyed often.

I have a couple test hosts that are in this bad state right now, so I'm
trying to get as much info out of them as I can.  I'm thinking of trying
SystemTap to introspect the kernel's run queue state and see what is
happening.

Here is the info you requested:

*/cgroup/cpu files:*

% for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
echo ----$f:---- ; cat /cgroup/cpu/$f; done | head -n20
----cpu.cfs_quota_us:----
0
----cpu.cfs_period_us:----
0
----cpu.shares:----
1024
----cpu.stat:----
nr_periods 0
nr_throttled 0
throttled_time 0
----tasks:----
1
...

*/cgroup/cpu/mesos files:*

% for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
echo ----$f:---- ; cat /cgroup/cpu/mesos/$f; done
----cpu.cfs_quota_us:----
-1
----cpu.cfs_period_us:----
100000
----cpu.shares:----
1024
----cpu.stat:----
nr_periods 0
nr_throttled 0
throttled_time 0
----tasks:----

NOTE: no tasks, and the cpu.cfs_quota_us being -1.  But those are both
consistent with other hosts that aren't exhibiting this problem.

*/cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e files:*

% for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
echo ----$f:---- ; cat
/cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e/$f; done
----cpu.cfs_quota_us:----
1800000
----cpu.cfs_period_us:----
100000
----cpu.shares:----
18432
----cpu.stat:----
nr_periods 680868
nr_throttled 254025
throttled_time 55400010353125
----tasks:----
6473
...

- Erik

On Sun, Jan 3, 2016 at 10:38 AM, Jojy Varghese <jo...@mesosphere.io> wrote:

> Hi Erik
>   Happy to work on this with you. Thanks for the details.
>
> As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is
> the CPU cgroup hierarchy name. I am curious about the contents and cgroups
> hierarchy when this happens. Could you send the “mesos” hierarchy
> (directory tree) and contents of files like
> ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’,  ‘cpu.stat’.
>
> It does look strange that the parent cgroup is missing when child is
> present.
>
> Also, wondering if you are able to see same issue with latest Mesos and/or
> kernel?
>
> -Jojy
>
>
> > On Jan 2, 2016, at 9:43 PM, Erik Weathers <ew...@groupon.com.INVALID>
> wrote:
> >
> > hey Jojy,  Thanks for your reply.  Response inline.
> >
> > On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <jo...@mesosphere.io>
> wrote:
> >
> >>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
> >>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
> >> cause of
> >>>  why the kernel is ignoring these tasks?
> >>
> >> Was curious why you said the above. CPU scheduling shares are a function
> >> of their parent’s CPU bandwidth.
> >>
> >
> > This question arose from an earlier observation in my initial email:
> >
> > In my initial email I pointed out that the contents of /proc/sched_debug
> > list all of the CFS run queues, but it seems like some of those run
> queues
> > are missing on the affected hosts.  i.e., usually they look like this
> (only
> > including output for the 1st CPU's CFS run queues):
> >
> > % grep 'cfs_rq\[0\]' /proc/sched_debug
> > cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
> > cfs_rq[0]:/mesos
> > cfs_rq[0]:/
> >
> > But on the problematic hosts, they look like this:
> >
> > % grep 'cfs_rq\[0\]' /proc/sched_debug
> > cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
> > cfs_rq[0]:/
> >
> > Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.
> >
> > I'm not sure how that is possible, given my understanding that these
> > cfs_rq's are created from the special cgroups filesystem having
> directories
> > added to it, and since the /cgroup/cpu/mesos dir exists (as well as
> > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
> > the CFS run queues for "/mesos" could have been deleted.   I've been
> trying
> > to read the kernel cgroup CFS scheduling code, but it's tough for a newb.
> >
> > Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
> > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not
> suspicious.
> > i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
> > preventing the tasks from being scheduled.  It seems to be that the
> cgroup
> > settings of the parent are simply gone from the kernel.  Poof.
> >
> > At this point I'm assuming that the above observation is indeed the root
> > cause of the problem, and I'm simply hoping that whatever logic deleted
> the
> > "/mesos" run queue is fixed in either a newer kernel or newer mesos
> version.
> >
> > Thanks!
> >
> > - Erik
> >
> >
> >
> >>
> >> -Jojy
> >>
> >>
> >>> On Dec 30, 2015, at 6:55 PM, Erik Weathers
> <ew...@groupon.com.INVALID>
> >> wrote:
> >>>
> >>> I'm trying to figure out a situation where we see tasks in a mesos
> >>> container no longer being scheduled by the Linux kernel.  None of the
> >> tasks
> >>> in the container are zombies, nor are they stuck in "Disk sleep" state.
> >>> They are all in Running state.  But if I try to strace the processes
> the
> >>> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
> >>> instruction pointers) are changing at all in these tasks, and they're
> not
> >>> accumulating any cputime.   So the kernel is just not scheduling them.
> >>>
> >>> Despite the behavior described above, these non-running tasks *are*
> >> listed
> >>> in the run queues of /proc/sched_debug.  Notably, I have observed that
> on
> >>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run
> >> queues,
> >>> but on the hosts that have the broken scheduling, these run queues
> don't
> >>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
> >>> /proc/sched_debug.  That is mighty suspicious to me.
> >>>
> >>> I'm curious about:
> >>>
> >>>  - Has anyone seen similar behavior?
> >>>  - Are /foo/bar cgroups hierarchical such that /foo missing would
> >> prevent
> >>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
> >> cause of
> >>>  why the kernel is ignoring these tasks?
> >>>  - What creates the /mesos cfs run queue, and why would that cease to
> >>>  exist without the subordinate cgroups being cleaned up?
> >>>     - I'm assuming the creation of the "cpu" cgroup with the path
> >>>     "/mesos" done by mesos-slave creates this run queue.
> >>>     - But I'm not sure how/why it would be removed, since I still see a
> >>>     mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
> >> exists).
> >>>
> >>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
> >> patched
> >>> fixes into newer kernel versions that we are running on other hosts
> >> (e.g.,
> >>> 2.6.32-573.7.1.el6).
> >>>
> >>> Setup info:
> >>>
> >>> Kernel version:  2.6.32-431.el6.x86_64
> >>> Mesos version:  0.22.1
> >>> Containerizer: Mesos
> >>> Isolators: Have seen this behavior with both of these configs:
> >>>  cgroups/cpu,cgroups/mem
> >>>  cgroups/cpu,cgroups/mem,namespaces/pid
> >>>
> >>> Thanks for any insight or help!
> >>>
> >>> - Erik
> >>
> >>
>
>

Re: tasks not being scheduled; cfs_rq for /mesos is missing

Posted by Jojy Varghese <jo...@mesosphere.io>.

Hi Erik
  Happy to work on this with you. Thanks for the details. 

As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is the CPU cgroup hierarchy name. I am curious about the contents and cgroups hierarchy when this happens. Could you send the “mesos” hierarchy (directory tree) and contents of files like ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’,  ‘cpu.stat’.

It does look strange that the parent cgroup is missing when child is present. 

Also, wondering if you are able to see same issue with latest Mesos and/or kernel?

-Jojy
  

> On Jan 2, 2016, at 9:43 PM, Erik Weathers <ew...@groupon.com.INVALID> wrote:
> 
> hey Jojy,  Thanks for your reply.  Response inline.
> 
> On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <jo...@mesosphere.io> wrote:
> 
>>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
>> cause of
>>>  why the kernel is ignoring these tasks?
>> 
>> Was curious why you said the above. CPU scheduling shares are a function
>> of their parent’s CPU bandwidth.
>> 
> 
> This question arose from an earlier observation in my initial email:
> 
> In my initial email I pointed out that the contents of /proc/sched_debug
> list all of the CFS run queues, but it seems like some of those run queues
> are missing on the affected hosts.  i.e., usually they look like this (only
> including output for the 1st CPU's CFS run queues):
> 
> % grep 'cfs_rq\[0\]' /proc/sched_debug
> cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
> cfs_rq[0]:/mesos
> cfs_rq[0]:/
> 
> But on the problematic hosts, they look like this:
> 
> % grep 'cfs_rq\[0\]' /proc/sched_debug
> cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
> cfs_rq[0]:/
> 
> Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.
> 
> I'm not sure how that is possible, given my understanding that these
> cfs_rq's are created from the special cgroups filesystem having directories
> added to it, and since the /cgroup/cpu/mesos dir exists (as well as
> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
> the CFS run queues for "/mesos" could have been deleted.   I've been trying
> to read the kernel cgroup CFS scheduling code, but it's tough for a newb.
> 
> Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not suspicious.
> i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
> preventing the tasks from being scheduled.  It seems to be that the cgroup
> settings of the parent are simply gone from the kernel.  Poof.
> 
> At this point I'm assuming that the above observation is indeed the root
> cause of the problem, and I'm simply hoping that whatever logic deleted the
> "/mesos" run queue is fixed in either a newer kernel or newer mesos version.
> 
> Thanks!
> 
> - Erik
> 
> 
> 
>> 
>> -Jojy
>> 
>> 
>>> On Dec 30, 2015, at 6:55 PM, Erik Weathers <ew...@groupon.com.INVALID>
>> wrote:
>>> 
>>> I'm trying to figure out a situation where we see tasks in a mesos
>>> container no longer being scheduled by the Linux kernel.  None of the
>> tasks
>>> in the container are zombies, nor are they stuck in "Disk sleep" state.
>>> They are all in Running state.  But if I try to strace the processes the
>>> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
>>> instruction pointers) are changing at all in these tasks, and they're not
>>> accumulating any cputime.   So the kernel is just not scheduling them.
>>> 
>>> Despite the behavior described above, these non-running tasks *are*
>> listed
>>> in the run queues of /proc/sched_debug.  Notably, I have observed that on
>>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run
>> queues,
>>> but on the hosts that have the broken scheduling, these run queues don't
>>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
>>> /proc/sched_debug.  That is mighty suspicious to me.
>>> 
>>> I'm curious about:
>>> 
>>>  - Has anyone seen similar behavior?
>>>  - Are /foo/bar cgroups hierarchical such that /foo missing would
>> prevent
>>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
>> cause of
>>>  why the kernel is ignoring these tasks?
>>>  - What creates the /mesos cfs run queue, and why would that cease to
>>>  exist without the subordinate cgroups being cleaned up?
>>>     - I'm assuming the creation of the "cpu" cgroup with the path
>>>     "/mesos" done by mesos-slave creates this run queue.
>>>     - But I'm not sure how/why it would be removed, since I still see a
>>>     mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
>> exists).
>>> 
>>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
>> patched
>>> fixes into newer kernel versions that we are running on other hosts
>> (e.g.,
>>> 2.6.32-573.7.1.el6).
>>> 
>>> Setup info:
>>> 
>>> Kernel version:  2.6.32-431.el6.x86_64
>>> Mesos version:  0.22.1
>>> Containerizer: Mesos
>>> Isolators: Have seen this behavior with both of these configs:
>>>  cgroups/cpu,cgroups/mem
>>>  cgroups/cpu,cgroups/mem,namespaces/pid
>>> 
>>> Thanks for any insight or help!
>>> 
>>> - Erik
>> 
>>

Re: tasks not being scheduled; cfs_rq for /mesos is missing

Posted by Erik Weathers <ew...@groupon.com.INVALID>.

hey Jojy,  Thanks for your reply.  Response inline.

On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <jo...@mesosphere.io> wrote:

> > Are /foo/bar cgroups hierarchical such that /foo missing would prevent
> >   /foo/bar tasks from being scheduled?  i.e., might that be the root
> cause of
> >   why the kernel is ignoring these tasks?
>
> Was curious why you said the above. CPU scheduling shares are a function
> of their parent’s CPU bandwidth.
>

This question arose from an earlier observation in my initial email:

In my initial email I pointed out that the contents of /proc/sched_debug
list all of the CFS run queues, but it seems like some of those run queues
are missing on the affected hosts.  i.e., usually they look like this (only
including output for the 1st CPU's CFS run queues):

% grep 'cfs_rq\[0\]' /proc/sched_debug
cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
cfs_rq[0]:/mesos
cfs_rq[0]:/

But on the problematic hosts, they look like this:

% grep 'cfs_rq\[0\]' /proc/sched_debug
cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
cfs_rq[0]:/

Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.

I'm not sure how that is possible, given my understanding that these
cfs_rq's are created from the special cgroups filesystem having directories
added to it, and since the /cgroup/cpu/mesos dir exists (as well as
/cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
the CFS run queues for "/mesos" could have been deleted.   I've been trying
to read the kernel cgroup CFS scheduling code, but it's tough for a newb.

Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
/cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not suspicious.
i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
preventing the tasks from being scheduled.  It seems to be that the cgroup
settings of the parent are simply gone from the kernel.  Poof.

At this point I'm assuming that the above observation is indeed the root
cause of the problem, and I'm simply hoping that whatever logic deleted the
"/mesos" run queue is fixed in either a newer kernel or newer mesos version.

Thanks!

- Erik

>
> -Jojy
>
>
> > On Dec 30, 2015, at 6:55 PM, Erik Weathers <ew...@groupon.com.INVALID>
> wrote:
> >
> > I'm trying to figure out a situation where we see tasks in a mesos
> > container no longer being scheduled by the Linux kernel.  None of the
> tasks
> > in the container are zombies, nor are they stuck in "Disk sleep" state.
> > They are all in Running state.  But if I try to strace the processes the
> > strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
> > instruction pointers) are changing at all in these tasks, and they're not
> > accumulating any cputime.   So the kernel is just not scheduling them.
> >
> > Despite the behavior described above, these non-running tasks *are*
> listed
> > in the run queues of /proc/sched_debug.  Notably, I have observed that on
> > hosts without this problem that there exist "cfs_rq[N]:/mesos" run
> queues,
> > but on the hosts that have the broken scheduling, these run queues don't
> > exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
> > /proc/sched_debug.  That is mighty suspicious to me.
> >
> > I'm curious about:
> >
> >   - Has anyone seen similar behavior?
> >   - Are /foo/bar cgroups hierarchical such that /foo missing would
> prevent
> >   /foo/bar tasks from being scheduled?  i.e., might that be the root
> cause of
> >   why the kernel is ignoring these tasks?
> >   - What creates the /mesos cfs run queue, and why would that cease to
> >   exist without the subordinate cgroups being cleaned up?
> >      - I'm assuming the creation of the "cpu" cgroup with the path
> >      "/mesos" done by mesos-slave creates this run queue.
> >      - But I'm not sure how/why it would be removed, since I still see a
> >      mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
> exists).
> >
> > I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
> patched
> > fixes into newer kernel versions that we are running on other hosts
> (e.g.,
> > 2.6.32-573.7.1.el6).
> >
> > Setup info:
> >
> > Kernel version:  2.6.32-431.el6.x86_64
> > Mesos version:  0.22.1
> > Containerizer: Mesos
> > Isolators: Have seen this behavior with both of these configs:
> >   cgroups/cpu,cgroups/mem
> >   cgroups/cpu,cgroups/mem,namespaces/pid
> >
> > Thanks for any insight or help!
> >
> > - Erik
>
>

Re: tasks not being scheduled; cfs_rq for /mesos is missing

Posted by Jojy Varghese <jo...@mesosphere.io>.

> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>   /foo/bar tasks from being scheduled?  i.e., might that be the root cause of
>   why the kernel is ignoring these tasks?

Was curious why you said the above. CPU scheduling shares are a function of their parent’s CPU bandwidth.

-Jojy


> On Dec 30, 2015, at 6:55 PM, Erik Weathers <ew...@groupon.com.INVALID> wrote:
> 
> I'm trying to figure out a situation where we see tasks in a mesos
> container no longer being scheduled by the Linux kernel.  None of the tasks
> in the container are zombies, nor are they stuck in "Disk sleep" state.
> They are all in Running state.  But if I try to strace the processes the
> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
> instruction pointers) are changing at all in these tasks, and they're not
> accumulating any cputime.   So the kernel is just not scheduling them.
> 
> Despite the behavior described above, these non-running tasks *are* listed
> in the run queues of /proc/sched_debug.  Notably, I have observed that on
> hosts without this problem that there exist "cfs_rq[N]:/mesos" run queues,
> but on the hosts that have the broken scheduling, these run queues don't
> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
> /proc/sched_debug.  That is mighty suspicious to me.
> 
> I'm curious about:
> 
>   - Has anyone seen similar behavior?
>   - Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>   /foo/bar tasks from being scheduled?  i.e., might that be the root cause of
>   why the kernel is ignoring these tasks?
>   - What creates the /mesos cfs run queue, and why would that cease to
>   exist without the subordinate cgroups being cleaned up?
>      - I'm assuming the creation of the "cpu" cgroup with the path
>      "/mesos" done by mesos-slave creates this run queue.
>      - But I'm not sure how/why it would be removed, since I still see a
>      mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos exists).
> 
> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has patched
> fixes into newer kernel versions that we are running on other hosts (e.g.,
> 2.6.32-573.7.1.el6).
> 
> Setup info:
> 
> Kernel version:  2.6.32-431.el6.x86_64
> Mesos version:  0.22.1
> Containerizer: Mesos
> Isolators: Have seen this behavior with both of these configs:
>   cgroups/cpu,cgroups/mem
>   cgroups/cpu,cgroups/mem,namespaces/pid
> 
> Thanks for any insight or help!
> 
> - Erik