You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Bernerd Schaefer <be...@soundcloud.com> on 2013/09/18 13:36:17 UTC

Service Scheduling in Mesos

I'm curious to learn what's been going on in Mesos (and the general
ecosystem) around
service scheduling. In particular, I'm curious about how Mesos might work
in a
cluster where service tasks are more common than batch tasks, e.g., a
cluster
with a single framework for running stateless tasks and many frameworks for
running stateful tasks.

I haven't been able to find much information about how exactly service
scheduling fits with Mesos -- the dialogue is certainly skewed towards
ephemeral / batch scheduling at the moment. With that in mind, I've tried to
outline some topics I've been thinking about recently. What I'm really
curious
to know is:

1. Am I way off track?
2. For a service scheduler built today, how much is Mesos responsible for
and
   how much the framework? What about going forward?
3. Are there already some patterns/idioms for these kinds of things in
existing
   frameworks?

# Balancing tasks within a framework

For this, imagine a framework that schedules long-lived (service), stateless
tasks.

- If asked to schedule a task with comparatively large resource
requirements,
  the task may never get scheduled if it waits for a sufficiently large
  resource offer. Instead, it should attempt to reschedule existing tasks to
  "make room" for it. How might that work?

- If asked to schedule multiple copies of a task across different machines,
  some copies may never get scheduled if it waits for a sufficiently diverse
  set of resource offers. Instead, it should reschedule existing tasks to
  meet the availability requirements of the task. What might that look like?

Maybe both of these could be accomplished by using some combination of:

- using `requestResources` when large tasks are requested to try and get
bigger
  offers.

- using saved offers to relaunch existing tasks, and then hoarding the freed
  resources for scheduling new tasks.

# Resource contention / balancing tasks across frameworks

For this, imagine there are two frameworks, one like above, running
stateless
service tasks, the other responsible for a single stateful task. Again, the
cluster is relatively full.

- If the stateful scheduler wants to run its task on a particular machine,
but
  that machine's resources are currently consumed by the other framework,
what
  happens?

- If the stateful scheduler can run its task on any machine, but there
exists
  no single offer sufficiently large to run the task, what does it do?

Some possible ways to approach this:

- The ability to request that other frameworks release their saved offers,
as
  the resources may actually be available, but currently hoarded. I think
  `requestResources` on the scheduler might do this?

- The ability to request that other frameworks reschedule existing tasks.
This
  could be a "user-land" feature? If I have a particular slave in mind to
run
  my task and there is a way to find frameworks with tasks on that slave, I
  could randomly send some kind of "reschedule" message to one of the
  frameworks. This message might include the slave, my requested resources,
and
  a priority understood by all of my frameworks. The other framework could
then
  compare its priority with the message, and decide whether it should
  reschedule.

Cheers,

Bernerd
Engineer @ SoundCloud

Re: Service Scheduling in Mesos

Posted by Bernerd <be...@soundcloud.com>.

> Bernerd,
> 
> You should really out Marathon https://github.com/mesosphere/marathon
> This fits closely for what you've described ;)

Oh, I am! :)

Perhaps I can shorten, clarify, and generalize my underlying questions. 

Assume I have a single framework (say, marathon) and its tasks occupy all but 5G of ram on each slave.

1. What happens if I ask the framework to start a task that wants 10G. Would this be a concern solely of the framework? Or would Mesos be able to intervene? For example, as far as I can tell marathon would be very unlikely to schedule this task, instead scheduling smaller tasks as resources become available.

2. What happens if I bring up a new framework which informs Mesos that it wants to schedule a 10G task. Will Mesos intervene (e.g., by killing some of the other framework's tasks)? Or will it only be scheduled after some number of tasks happen to finish?

Bernerd
Engineer @ SoundCloud

Re: Service Scheduling in Mesos

Posted by Paco Nathan <ce...@gmail.com>.

Bernerd,

You should really out Marathon https://github.com/mesosphere/marathon
This fits closely for what you've described ;)




On Wed, Sep 18, 2013 at 4:36 AM, Bernerd Schaefer <be...@soundcloud.com>wrote:

> I'm curious to learn what's been going on in Mesos (and the general
> ecosystem) around
> service scheduling. In particular, I'm curious about how Mesos might work
> in a
> cluster where service tasks are more common than batch tasks, e.g., a
> cluster
> with a single framework for running stateless tasks and many frameworks for
> running stateful tasks.
>
> I haven't been able to find much information about how exactly service
>  scheduling fits with Mesos -- the dialogue is certainly skewed towards
> ephemeral / batch scheduling at the moment. With that in mind, I've tried
> to
> outline some topics I've been thinking about recently. What I'm really
> curious
> to know is:
>
> 1. Am I way off track?
> 2. For a service scheduler built today, how much is Mesos responsible for
> and
>    how much the framework? What about going forward?
> 3. Are there already some patterns/idioms for these kinds of things in
> existing
>    frameworks?
>
> # Balancing tasks within a framework
>
> For this, imagine a framework that schedules long-lived (service),
> stateless
> tasks.
>
> - If asked to schedule a task with comparatively large resource
> requirements,
>   the task may never get scheduled if it waits for a sufficiently large
>   resource offer. Instead, it should attempt to reschedule existing tasks
> to
>   "make room" for it. How might that work?
>
> - If asked to schedule multiple copies of a task across different machines,
>   some copies may never get scheduled if it waits for a sufficiently
> diverse
>   set of resource offers. Instead, it should reschedule existing tasks to
>   meet the availability requirements of the task. What might that look
> like?
>
> Maybe both of these could be accomplished by using some combination of:
>
> - using `requestResources` when large tasks are requested to try and get
> bigger
>   offers.
>
> - using saved offers to relaunch existing tasks, and then hoarding the
> freed
>   resources for scheduling new tasks.
>
> # Resource contention / balancing tasks across frameworks
>
> For this, imagine there are two frameworks, one like above, running
> stateless
> service tasks, the other responsible for a single stateful task. Again, the
> cluster is relatively full.
>
> - If the stateful scheduler wants to run its task on a particular machine,
> but
>   that machine's resources are currently consumed by the other framework,
> what
>   happens?
>
> - If the stateful scheduler can run its task on any machine, but there
> exists
>   no single offer sufficiently large to run the task, what does it do?
>
> Some possible ways to approach this:
>
> - The ability to request that other frameworks release their saved offers,
> as
>   the resources may actually be available, but currently hoarded. I think
>   `requestResources` on the scheduler might do this?
>
> - The ability to request that other frameworks reschedule existing tasks.
> This
>   could be a "user-land" feature? If I have a particular slave in mind to
> run
>   my task and there is a way to find frameworks with tasks on that slave, I
>   could randomly send some kind of "reschedule" message to one of the
>   frameworks. This message might include the slave, my requested
> resources, and
>   a priority understood by all of my frameworks. The other framework could
> then
>   compare its priority with the message, and decide whether it should
>   reschedule.
>
> Cheers,
>
> Bernerd
> Engineer @ SoundCloud
>

Re: Service Scheduling in Mesos

Posted by Bill Farner <bi...@twitter.com>.

On Thu, Sep 19, 2013 at 11:55 AM, Bernerd Schaefer
<be...@soundcloud.com>wrote:

> Thanks for the response, Bill. Some followups below.
>
>
>> I haven't found a great way to approach either of these in mesos without
>>> assuming that your framework has full control of the cluster.  This is
>>> covered a bit in the Omega paper [1]:
>>>
>>
>> *While a Mesos framework can use “ﬁlters” to describe **the kinds of
>> resources that it would like to be offered, it does **not have access to
>> a view of the overall cluster state – just the **resources it has been
>> offered. As a result, it cannot support **preemption or policies
>> requiring access to the whole cluster **state: a framework simply does
>> not have any knowledge **of resources that have been allocated to other
>> schedulers.*
>>
>>
> I'm curious about your take as a framework author on the Omega paper's
> evaluation of Mesos. I would summarize their valuation somewhere between --
> best-case -- "Mesos is currently non-optimal for running a service
> scheduler alongside other schedulers," and -- worst-case -- "Mesos is
> fundamentally unsuitable for service schedulers which do not own the entire
> cluster."
>

You're right, their interpretation is not very optimistic.  However, as i
understand it — reservations does help the situation such that we can do
better than described in the paper.  I think documentation on reservations
would be really helpful, probably specifically in a way that addresses the
concern raised by the omega paper.  (Apparently i'm not even up to date on
the reservations feature — Ben tells me this works today.)


> The risk with this approach is that you wind up not playing nicely with
>> other frameworks, possibly starving them of offers.  Unfortunately this is
>> the best way i've found to glean the shape of the cluster.
>>
>
>> Aurora cheats here by 'pinning' tasks to the same machines all the time,
>> and (currently) not running anything else on those machines.  Of course,
>> this strategy falls apart when other frameworks are introduced.  I believe
>> mesos' reservations feature intends to address this.
>>
>
> Given these comments -- am I to gather that Aurora runs on its own
> dedicated Mesos cluster? Regardless, it sounds like you've had to make
> Aurora itself a monolithic scheduler, which is discouraging.
>

That is correct today, but a result of due diligence (read: paranoia) on my
part more-so than technical limitations.  We run a lot of critical stuff on
Aurora, and haven't had the time to test interplay between multiple
frameworks well enough to be comfortable with the idea.  To be clear,
though — aurora alongside other frameworks is indeed the direction we
intend to go.


> To my mind, the promise of Mesos is that I shouldn't have to build a
> scheduler that works for all-different kinds of tasks. I dream of a Mesos
> where my scheduler for stateless services lives happily alongside both my
> haproxy, memcached, elasticsearch schedulers, and my hadoop, spark, storm
> schedulers. Is it just not there yet?
>
> Bernerd
> Engineer @ SoundCloud
>

Re: Service Scheduling in Mesos

Posted by Paco Nathan <ce...@gmail.com>.

>From what I understand, the "Omega" paper was written in 2012. It's great.
Much has been added to Apache Mesos since. Particularly w.r.t. scheduling
services. Also, the two-level categorization arguably has evolved further.


On Thu, Sep 19, 2013 at 11:55 AM, Bernerd Schaefer
<be...@soundcloud.com>wrote:

> Thanks for the response, Bill. Some followups below.
>
>
>> I haven't found a great way to approach either of these in mesos without
>>> assuming that your framework has full control of the cluster.  This is
>>> covered a bit in the Omega paper [1]:
>>>
>>
>> *While a Mesos framework can use “ﬁlters” to describe **the kinds of
>> resources that it would like to be offered, it does **not have access to
>> a view of the overall cluster state – just the **resources it has been
>> offered. As a result, it cannot support **preemption or policies
>> requiring access to the whole cluster **state: a framework simply does
>> not have any knowledge **of resources that have been allocated to other
>> schedulers.*
>>
>>
> I'm curious about your take as a framework author on the Omega paper's
> evaluation of Mesos. I would summarize their valuation somewhere between --
> best-case -- "Mesos is currently non-optimal for running a service
> scheduler alongside other schedulers," and -- worst-case -- "Mesos is
> fundamentally unsuitable for service schedulers which do not own the entire
> cluster."
>
> The risk with this approach is that you wind up not playing nicely with
>> other frameworks, possibly starving them of offers.  Unfortunately this is
>> the best way i've found to glean the shape of the cluster.
>>
>
>> Aurora cheats here by 'pinning' tasks to the same machines all the time,
>> and (currently) not running anything else on those machines.  Of course,
>> this strategy falls apart when other frameworks are introduced.  I believe
>> mesos' reservations feature intends to address this.
>>
>
> Given these comments -- am I to gather that Aurora runs on its own
> dedicated Mesos cluster? Regardless, it sounds like you've had to make
> Aurora itself a monolithic scheduler, which is discouraging.
>
> To my mind, the promise of Mesos is that I shouldn't have to build a
> scheduler that works for all-different kinds of tasks. I dream of a Mesos
> where my scheduler for stateless services lives happily alongside both my
> haproxy, memcached, elasticsearch schedulers, and my hadoop, spark, storm
> schedulers. Is it just not there yet?
>
> Bernerd
> Engineer @ SoundCloud
>

Re: Service Scheduling in Mesos

Posted by Bernerd Schaefer <be...@soundcloud.com>.

Thanks for the response, Bill. Some followups below.


> I haven't found a great way to approach either of these in mesos without
>> assuming that your framework has full control of the cluster.  This is
>> covered a bit in the Omega paper [1]:
>>
>
> *While a Mesos framework can use “ﬁlters” to describe **the kinds of
> resources that it would like to be offered, it does **not have access to
> a view of the overall cluster state – just the **resources it has been
> offered. As a result, it cannot support **preemption or policies
> requiring access to the whole cluster **state: a framework simply does
> not have any knowledge **of resources that have been allocated to other
> schedulers.*
>
>
I'm curious about your take as a framework author on the Omega paper's
evaluation of Mesos. I would summarize their valuation somewhere between --
best-case -- "Mesos is currently non-optimal for running a service
scheduler alongside other schedulers," and -- worst-case -- "Mesos is
fundamentally unsuitable for service schedulers which do not own the entire
cluster."

The risk with this approach is that you wind up not playing nicely with
> other frameworks, possibly starving them of offers.  Unfortunately this is
> the best way i've found to glean the shape of the cluster.
>

> Aurora cheats here by 'pinning' tasks to the same machines all the time,
> and (currently) not running anything else on those machines.  Of course,
> this strategy falls apart when other frameworks are introduced.  I believe
> mesos' reservations feature intends to address this.
>

Given these comments -- am I to gather that Aurora runs on its own
dedicated Mesos cluster? Regardless, it sounds like you've had to make
Aurora itself a monolithic scheduler, which is discouraging.

To my mind, the promise of Mesos is that I shouldn't have to build a
scheduler that works for all-different kinds of tasks. I dream of a Mesos
where my scheduler for stateless services lives happily alongside both my
haproxy, memcached, elasticsearch schedulers, and my hadoop, spark, storm
schedulers. Is it just not there yet?

Bernerd
Engineer @ SoundCloud

Re: Service Scheduling in Mesos

Posted by Bill Farner <bi...@twitter.com>.

(apologies if this breaks threading, i'm replying after subscribing this
email address)

Great questions!  Some responses below from my experience and perspective
formed while working on Aurora.

2. For a service scheduler built today, how much is Mesos responsible for
> and
> how much the framework? What about going forward?

It seems the most natural behavior is for mesos to notify the framework of
events, and the framework to apply those events to its state (persistent or
otherwise).  Notably missing are APIs to help reconcile state mismatches.
 Aurora uses framework messages and a special executor to assist in this
(i.e. compare what the scheduler thinks is on a machine versus what's
actually there).  Never mind whose fault the state mismatch is, it can
happen due to bugs or feature gaps on either side; and it's nice when the
system can auto-correct.

> - If asked to schedule a task with comparatively large resource
> requirements,
>   the task may never get scheduled if it waits for a sufficiently large
>   resource offer. Instead, it should attempt to reschedule existing tasks
> to
>   "make room" for it. How might that work?

Two approaches can help here: proactive defragmentation (induce fewer
larger resource offers), and preemption (create space on-demand).  I
haven't found a great way to approach either of these in mesos without
assuming that your framework has full control of the cluster.  This is
covered a bit in the Omega paper [1]:

*While a Mesos framework can use “ﬁlters” to describe **the kinds of
resources that it would like to be offered, it does **not have access to a
view of the overall cluster state – just the **resources it has been
offered. As a result, it cannot support **preemption or policies requiring
access to the whole cluster **state: a framework simply does not have any
knowledge **of resources that have been allocated to other schedulers.*

- If asked to schedule multiple copies of a task across different machines,
>   some copies may never get scheduled if it waits for a sufficiently
> diverse
>   set of resource offers. Instead, it should reschedule existing tasks to
>   meet the availability requirements of the task. What might that look
> like?

Aurora accepts this possibility on the assumption that stateless tasks
don't need all of their tasks for nominal operation (i.e. they're usually
intentionally over-provisioned).  However, our only strategy to converge
towards zero pending tasks is priority-based preemption.

- using saved offers to relaunch existing tasks, and then hoarding the freed
>   resources for scheduling new tasks.

This is done in aurora, though currently for a different reason (finding
the best offer for a task rather thank choosing the first fit).  The risk
with this approach is that you wind up not playing nicely with other
frameworks, possibly starving them of offers.  Unfortunately this is the
best way i've found to glean the shape of the cluster.

- If the stateful scheduler wants to run its task on a particular machine,
> but
>   that machine's resources are currently consumed by the other framework,
> what
>   happens?

Aurora cheats here by 'pinning' tasks to the same machines all the time,
and (currently) not running anything else on those machines.  Of course,
this strategy falls apart when other frameworks are introduced.  I believe
mesos' reservations feature intends to address this.

-=Bill

[1]
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

On Wed, Sep 18, 2013 at 10:50 AM, Vinod Kone <vi...@twitter.com> wrote:

>
>
>
> @vinodkone
>
>
> On Wed, Sep 18, 2013 at 4:36 AM, Bernerd Schaefer <be...@soundcloud.com>wrote:
>
>> I'm curious to learn what's been going on in Mesos (and the general
>> ecosystem) around
>> service scheduling. In particular, I'm curious about how Mesos might work
>> in a
>> cluster where service tasks are more common than batch tasks, e.g., a
>> cluster
>> with a single framework for running stateless tasks and many frameworks
>> for
>> running stateful tasks.
>>
>> I haven't been able to find much information about how exactly service
>>  scheduling fits with Mesos -- the dialogue is certainly skewed towards
>> ephemeral / batch scheduling at the moment. With that in mind, I've tried
>> to
>> outline some topics I've been thinking about recently. What I'm really
>> curious
>> to know is:
>>
>> 1. Am I way off track?
>> 2. For a service scheduler built today, how much is Mesos responsible for
>> and
>>    how much the framework? What about going forward?
>> 3. Are there already some patterns/idioms for these kinds of things in
>> existing
>>    frameworks?
>>
>> # Balancing tasks within a framework
>>
>> For this, imagine a framework that schedules long-lived (service),
>> stateless
>> tasks.
>>
>> - If asked to schedule a task with comparatively large resource
>> requirements,
>>   the task may never get scheduled if it waits for a sufficiently large
>>   resource offer. Instead, it should attempt to reschedule existing tasks
>> to
>>   "make room" for it. How might that work?
>>
>> - If asked to schedule multiple copies of a task across different
>> machines,
>>   some copies may never get scheduled if it waits for a sufficiently
>> diverse
>>   set of resource offers. Instead, it should reschedule existing tasks to
>>   meet the availability requirements of the task. What might that look
>> like?
>>
>> Maybe both of these could be accomplished by using some combination of:
>>
>> - using `requestResources` when large tasks are requested to try and get
>> bigger
>>   offers.
>>
>> - using saved offers to relaunch existing tasks, and then hoarding the
>> freed
>>   resources for scheduling new tasks.
>>
>> # Resource contention / balancing tasks across frameworks
>>
>> For this, imagine there are two frameworks, one like above, running
>> stateless
>> service tasks, the other responsible for a single stateful task. Again,
>> the
>> cluster is relatively full.
>>
>> - If the stateful scheduler wants to run its task on a particular
>> machine, but
>>   that machine's resources are currently consumed by the other framework,
>> what
>>   happens?
>>
>> - If the stateful scheduler can run its task on any machine, but there
>> exists
>>   no single offer sufficiently large to run the task, what does it do?
>>
>> Some possible ways to approach this:
>>
>> - The ability to request that other frameworks release their saved
>> offers, as
>>   the resources may actually be available, but currently hoarded. I think
>>   `requestResources` on the scheduler might do this?
>>
>> - The ability to request that other frameworks reschedule existing tasks.
>> This
>>   could be a "user-land" feature? If I have a particular slave in mind to
>> run
>>   my task and there is a way to find frameworks with tasks on that slave,
>> I
>>   could randomly send some kind of "reschedule" message to one of the
>>   frameworks. This message might include the slave, my requested
>> resources, and
>>   a priority understood by all of my frameworks. The other framework
>> could then
>>   compare its priority with the message, and decide whether it should
>>   reschedule.
>>
>> Cheers,
>>
>> Bernerd
>> Engineer @ SoundCloud
>>
>
>