You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Grégoire Seux <g....@criteo.com> on 2019/10/01 12:17:19 UTC

large task scheduling on multi-framework cluster

Hello,

I'm wondering how other mesos users deal with scheduling of large tasks (using all resources offered by most agents).

On our cluster, we have various application launched mainly by marathon. Some of those applications have large instances (30 cpus) which use all resources from agents (most of our agents expose 30 cpus to mesos). Beyond these large applications (many instances, many resource per instance) we have a lot more applications whose instances are of various size (from 1 to 10 cpus).

Our issue lies with scheduling, since marathon uses offers from mesos as they come and it creates fragmentation: most agents have small tasks running which prevents big tasks to be scheduled. In an ideal world, mesos (or marathon) would make sure some apps (let's say frameworks if mesos takes that responsibility) have guarantees on large offers. We also have non-marathon in-house frameworks which have similar needs to launch large tasks.

Our current solution is to:

  *   use a dedicated marathon instance (and a dedicated role) for those big applications
  *   dedicate agents to this role

Of course, this require extra work since our mesos clusters are now sharded (it creates additional toil in term of maintenance & capacity planning).
Our thinking is that mesos allocator might be improved to distribute offers with a better heuristic than currently (offers are randomly sorted). A bit similar to what was suggested on http://mail-archives.apache.org/mod_mbox/mesos-user/201906.mbox/%3cCAHReGaiY0nJ0AevMvKbxAZsy2Xc=JMtszCUCDXRYZBvwKVvaUA@mail.gmail.com%3e, we could imagine to sort offers (offers from most used slaves first).

So I'm curious on how other users handle this kind of needs!

Regards,

-- ​
Grégoire Seux

Re: [BULK]Re: large task scheduling on multi-framework cluster

Posted by Tim Harper <ti...@gmail.com>.
Marathon 1.9 has multirole support, by the way

Sent from my Apple Watch

On Oct 7, 2019, at 11:19, Grégoire Seux <g....@criteo.com> wrote:

> Hello  Benjamin,
> 
>> Note that with the newest marathon that is capable of handling multiple roles, you would not need to run a dedicated marathon instance.
> True it is not strictly necessary. We use this as an easy way to deal with various needs:
> - quota on some roles (a multi role marathon could address this)
> - easy authorization configuration (otherwise we would need to configure authorizations to only allow specific users to use some roles)
> - general resiliency: if one marathon has a random bug and start deleting all its apps, at least the others are unlikely to do this at the same time!
> - performance: each marathon handle less tasks and less healthchecks
> 
> But it is not really the topic of my question, I gave this as a context precision.
> 
> Anyone encounting the same issue when scheduling large tasks?
> 
> -- 
> Grégoire
> 
> 
> 

Re: [BULK]Re: large task scheduling on multi-framework cluster

Posted by Grégoire Seux <g....@criteo.com>.
Hello  Benjamin,

> Note that with the newest marathon that is capable of handling multiple roles, you would not need to run a dedicated marathon instance.
True it is not strictly necessary. We use this as an easy way to deal with various needs:
- quota on some roles (a multi role marathon could address this)
- easy authorization configuration (otherwise we would need to configure authorizations to only allow specific users to use some roles)
- general resiliency: if one marathon has a random bug and start deleting all its apps, at least the others are unlikely to do this at the same time!
- performance: each marathon handle less tasks and less healthchecks

But it is not really the topic of my question, I gave this as a context precision.

Anyone encounting the same issue when scheduling large tasks?

-- 
Grégoire




Re: large task scheduling on multi-framework cluster

Posted by Benjamin Mahler <bm...@apache.org>.
Note that with the newest marathon that is capable of handling multiple
roles, you would not need to run a dedicated marathon instance.

On Tue, Oct 1, 2019 at 8:17 AM Grégoire Seux <g....@criteo.com> wrote:

> Hello,
>
> I'm wondering how other mesos users deal with scheduling of large tasks
> (using all resources offered by most agents).
>
> On our cluster, we have various application launched mainly by marathon.
> Some of those applications have large instances (30 cpus) which use all
> resources from agents (most of our agents expose 30 cpus to mesos). Beyond
> these large applications (many instances, many resource per instance) we
> have a lot more applications whose instances are of various size (from 1 to
> 10 cpus).
>
> Our issue lies with scheduling, since marathon uses offers from mesos as
> they come and it creates fragmentation: most agents have small tasks
> running which prevents big tasks to be scheduled. In an ideal world, mesos
> (or marathon) would make sure some apps (let's say frameworks if mesos
> takes that responsibility) have guarantees on large offers. We also have
> non-marathon in-house frameworks which have similar needs to launch large
> tasks.
>
> Our current solution is to:
>
>    - use a dedicated marathon instance (and a dedicated role) for those
>    big applications
>    - dedicate agents to this role
>
> Of course, this require extra work since our mesos clusters are now
> sharded (it creates additional toil in term of maintenance & capacity
> planning).
> Our thinking is that mesos allocator might be improved to distribute
> offers with a better heuristic than currently (offers are randomly sorted).
> A bit similar to what was suggested on
> http://mail-archives.apache.org/mod_mbox/mesos-user/201906.mbox/%3cCAHReGaiY0nJ0AevMvKbxAZsy2Xc=JMtszCUCDXRYZBvwKVvaUA@mail.gmail.com%3e,
> we could imagine to sort offers (offers from most used slaves first).
>
> So I'm curious on how other users handle this kind of needs!
>
> Regards,
>
> -- ​
> Grégoire Seux
>