You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Mohit Soni <mo...@gmail.com> on 2014/03/11 03:58:27 UTC

Mesos, multiple frameworks and starvation

I was running a load test on a mesos-cluster, and observed that when mesos is running lots of frameworks, offer starvation occurs for certain frameworks, i.e. only a subset of frameworks registered with mesos gets offers. Let me describe the scenario below:

First phase:
At the beginning, there’s only one framework registered with mesos, which is ‘Marathon’. The load generator, uses Marathon’s API to launch let’s say 50 Jenkins masters, with mesos-plugin installed. Once all 50 masters are launched, the mesos-cluster now have 51 frameworks registered in total, because the mesos-plugin registers itself with mesos-master as a framework.

Second phase:
Now, the load generator goes and triggers couple of build jobs on each Jenkins Master. Each framework’s Schedular will now have let’s say 2 items in it’s build queue. Once framework get’s a resource offer from Master, it’s schedular can perform the build tasks, if the offer matches the resource constraints as specified by mesos-plugin.

What I observed was, at the start of second phase, some frameworks (jenkins masters) got offers and got their tasks scheduled to run. But, rest of the frameworks, didn’t get resource offers from mesos-master, and the build jobs scheduled on those, got starved. Tailing jenkins logs on these masters never showed: 'Received offers’. Also, according to mesos master logs, mesos was sending offers to only a handful of frameworks. The logs below show the message from a minute, but I saw the similar behavior at other times, I have added a line break after each group of frameworks getting offers:

I0310 17:56:44.703126  1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0364
I0310 17:56:45.722951  1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0371
I0310 17:56:46.744184  1159 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0377
I0310 17:56:47.768546  1158 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0380
I0310 17:56:48.794517  1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0396

I0310 17:56:49.813484  1157 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0364
I0310 17:56:50.833155  1159 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0371
I0310 17:56:51.859712  1158 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0377
I0310 17:56:52.879678  1153 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0380
I0310 17:56:53.904261  1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0396

I0310 17:56:54.929472  1155 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0364
I0310 17:56:55.947387  1153 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0371
I0310 17:56:56.975060  1157 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0377
I0310 17:56:57.996995  1159 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0380
I0310 17:56:59.022555  1156 master.cpp:2250] Sending 24 offers to framework 201403032301-1255541002-5050-1126-0396

Couple of questions:
1. Does running multiple frameworks (say more than 10), have an impact on resource allocation strategy ?
2. If a registered framework keeps declining mesos offers for a while, does mesos take that into account while sending offers ?

Links:
1. https://github.com/mesosphere/marathon
2. https://github.com/jenkinsci/mesos-plugin

-- 
Mohit

Re: Mesos, multiple frameworks and starvation

Posted by Benjamin Hindman <be...@eecs.berkeley.edu>.
Hi Mohit,

The scenario makes sense and unfortunately this is really a bug in how we
do allocations.

The default allocator in Mesos implements a weighted fair-sharing algorithm
called dominant resource fairness. This does well when there are tasks that
are "short-lived" (or at least, don't live forever) and an adequate amount
of resources to "go around". In your case, it sounds like there is a lot
more computation (build jobs) then available resources.

I've filed https://issues.apache.org/jira/browse/MESOS-1086 and put up a
patch at https://reviews.apache.org/r/19090.

Thanks for the detailed report Mohit!

Ben.


On Mon, Mar 10, 2014 at 7:58 PM, Mohit Soni <mo...@gmail.com> wrote:

> I was running a load test on a mesos-cluster, and observed that when mesos
> is running lots of frameworks, offer starvation occurs for certain
> frameworks, i.e. only a subset of frameworks registered with mesos gets
> offers. Let me describe the scenario below:
>
> First phase:
> At the beginning, there's only one framework registered with mesos, which
> is 'Marathon'. The load generator, uses Marathon's API to launch let's say
> 50 Jenkins masters, with mesos-plugin installed. Once all 50 masters are
> launched, the mesos-cluster now have 51 frameworks registered in total,
> because the mesos-plugin registers itself with mesos-master as a framework.
>
> Second phase:
> Now, the load generator goes and triggers couple of build jobs on each
> Jenkins Master. Each framework's Schedular will now have let's say 2 items
> in it's build queue. Once framework get's a resource offer from Master,
> it's schedular can perform the build tasks, if the offer matches the
> resource constraints as specified by mesos-plugin.
>
> What I observed was, at the start of second phase, some frameworks
> (jenkins masters) got offers and got their tasks scheduled to run. But,
> rest of the frameworks, didn't get resource offers from mesos-master, and
> the build jobs scheduled on those, got starved. Tailing jenkins logs on
> these masters never showed: 'Received offers'. Also, according to mesos
> master logs, mesos was sending offers to only a handful of frameworks. The
> logs below show the message from a minute, but I saw the similar behavior
> at other times, I have added a line break after each group of frameworks
> getting offers:
>
> I0310 17:56:44.703126  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0364
> I0310 17:56:45.722951  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0371
> I0310 17:56:46.744184  1159 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0377
> I0310 17:56:47.768546  1158 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0380
> I0310 17:56:48.794517  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0396
>
> I0310 17:56:49.813484  1157 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0364
> I0310 17:56:50.833155  1159 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0371
> I0310 17:56:51.859712  1158 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0377
> I0310 17:56:52.879678  1153 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0380
> I0310 17:56:53.904261  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0396
>
> I0310 17:56:54.929472  1155 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0364
> I0310 17:56:55.947387  1153 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0371
> I0310 17:56:56.975060  1157 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0377
> I0310 17:56:57.996995  1159 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0380
> I0310 17:56:59.022555  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0396
>
> Couple of questions:
> 1. Does running multiple frameworks (say more than 10), have an impact on
> resource allocation strategy ?
> 2. If a registered framework keeps declining mesos offers for a while,
> does mesos take that into account while sending offers ?
>
> Links:
> 1. https://github.com/mesosphere/marathon
> 2. https://github.com/jenkinsci/mesos-plugin
>
> --
> Mohit
>