You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Clément Michaud <cl...@gmail.com> on 2018/10/11 08:29:40 UTC

Adding support for implicit allocation of mandatory custom resources in Mesos

Hello,

TL;DR; we have added network bandwidth as a first-class resource in our
clusters with a custom isolator and we have patched Mesos master to
introduce the concept of implicit allocation of custom resources to make
network bandwidth mandatory for all tasks. I'd like to know what you think
about what we have implemented and if you think we could introduce a new
hook with the aim of injecting mandatory custom resources to tasks in Mesos
master.


At Criteo we have implemented a custom solution in our Mesos clusters to
prevent network noisy neighbors and to allow our users to define a custom
amount of reserved network bandwidth per application. Please note we run
our clusters on a flat network and we are not using any kind of network
overlay.

In order to address these use cases, we enabled the `net_cls` isolator and
wrote an isolator using tc, conntrack and iptables, each container having a
dedicated custom reserved amount of network bandwidth declared by
configuration in Marathon or Aurora.

In the first implementation of our solution, the resources were not
declared in the agents and obviously not taken into account by Mesos but
the isolator allocated an amount of network bandwidth for each task
relative to the number of reserved CPUs and the number of available CPUs on
the server. Basically, the per container network bandwidth limitation was
applied but Mesos was not aware of it. Using the CPU as a proxy for the
amount of network bandwidth protected us from situations where an agent
could allocate more network bandwidth than available on the agent. However,
this model reached its limits when we introduced big consumers of network
bandwidth in our clusters. They had to raise the number of CPUs to get more
network bandwidth and therefore it introduced scheduling issues.

Hence, we decided to leverage Mesos custom resources to let our users
declare their requirements but also to decouple network bandwidth from CPU
to avoid scheduling issues. We first declared the network bandwidth
resource on every Mesos agents even if tasks were not declaring any. Then,
we faced a first issue: the lack of support of network bandwidth and/or
custom resources in Marathon and Aurora (well it seems in most frameworks
actually). This led to a second issue: we needed Mesos to account for the
network bandwidth of all tasks even if some frameworks were not supporting
it yet. Solving the second problem allowed us to run a smooth migration by
patching frameworks independently in a second phase.

On the way we found out that the “custom resources” system wasn’t meeting
our needs, because it only allows for “optional resources”, and not
“mandatory resources” (resources that should be accounted for all tasks in
a cluster, even if not required explicitly, like CPU, RAM or disk space.
Good candidates are network bandwidth or disk I/O for instance).

To enforce the usage of network bandwidth across all tasks we wanted to
allocate an implicit amount of network bandwidth to tasks not declaring any
in their configuration. One possible implementation was to make the Mesos
master automatically compute the allocated network bandwidth for the task
when the offer is accepted and subtract this amount from the overall
available resources in Mesos. We consider this implicit allocation as a
fallback mechanism for frameworks not supporting "mandatory" resources.
Indeed, in a proper environment all frameworks would support these
mandatory resources. Unfortunately, adding support for a new resource (or
for custom resources) in all frameworks might not be manageable in a timely
manner, especially in an ecosystem with multiple frameworks.

Consequently, we wrote a patch in Mesos master to allocate an implicit
amount of network bandwidth when it is not provided in the TaskInfo. In our
case this implicit amount is computed based on the following Criteo
specific rule: `task_used_cpu / slave_total_cpus * slave_total_bandwidth`.

Here is what happened when our frameworks were not supporting network
bandwidth yet: offers were sent to frameworks and they accepted or rejected
them regardless of network bandwidth available on the slave. When an offer
was accepted, the TaskInfo sent by the framework obviously did not contain
any network bandwidth but Mesos master implicitly injected some and let the
task follow its way. There were two cases then: either the slave had enough
resources to run the task and it was scheduled as expected or it did not
have enough resources and the task failed to be deployed and Mesos sent
back a TASK_ERROR to the framework. It was the responsibility of the
scheduler to retry with following offers. This solution created a bit of
extra work for the master but we tested it and ran it in production for few
weeks in several clusters of around 250 servers each and it seemed to work
well, at least with Marathon and Aurora. At this point the migration was
expected to be smooth because it only required a restart of all the tasks
for network bandwidth to be introduced cluster wide. It ended up being as
smooth as expected.

In the meantime, we obviously patched Marathon and Aurora to add full
support of network bandwidth and avoid the potential TASK_ERROR messages
while keeping in mind that we'll soon host other frameworks that would
probably not support network bandwidth from the beginning. So we'll likely
keep our patch in the future and we think it might be a good idea to
introduce a hook in Mesos master to add implicit resources to tasks.

What we propose is to introduce a method called
masterLaunchTaskResourceDecorator in the hook interface and call it at the
right location to let the user add whatever implicit resource he wants.

This would give the following signature:

```

Result<Resources> masterLaunchTaskResourceDecorator(

 const Resources& slaveResources,

 TaskInfo& task)

```

Can you please tell us if such an integration point would be acceptable to
be merged upstream?


You can have a look at our current implementation here:
https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth
 (just as a reference, it is not the base of a patch for upstream mesos).

Thank you,

Clément.

Re: Adding support for implicit allocation of mandatory custom resources in Mesos

Posted by Clément Michaud <cl...@gmail.com>.
Hello Benjamin,

Sure thing! I will file a ticket and do the patch.

Thank you,
Clément.

On Thu, Oct 11, 2018 at 9:39 PM Benjamin Mahler <bm...@apache.org> wrote:

> Thanks for the thorough explanation.
>
> Yes, it sounds acceptable and useful for assigning disk i/o and network
> i/o. The error case of there not being enough resources post-injection
> seems unfortunate but I don't see a way around it.
>
> Can you file a ticket with this background?
>
> On Thu, Oct 11, 2018 at 1:30 AM Clément Michaud <
> clement.michaud34@gmail.com>
> wrote:
>
> > Hello,
> >
> > TL;DR; we have added network bandwidth as a first-class resource in our
> > clusters with a custom isolator and we have patched Mesos master to
> > introduce the concept of implicit allocation of custom resources to make
> > network bandwidth mandatory for all tasks. I'd like to know what you
> think
> > about what we have implemented and if you think we could introduce a new
> > hook with the aim of injecting mandatory custom resources to tasks in
> Mesos
> > master.
> >
> >
> > At Criteo we have implemented a custom solution in our Mesos clusters to
> > prevent network noisy neighbors and to allow our users to define a custom
> > amount of reserved network bandwidth per application. Please note we run
> > our clusters on a flat network and we are not using any kind of network
> > overlay.
> >
> > In order to address these use cases, we enabled the `net_cls` isolator
> and
> > wrote an isolator using tc, conntrack and iptables, each container
> having a
> > dedicated custom reserved amount of network bandwidth declared by
> > configuration in Marathon or Aurora.
> >
> > In the first implementation of our solution, the resources were not
> > declared in the agents and obviously not taken into account by Mesos but
> > the isolator allocated an amount of network bandwidth for each task
> > relative to the number of reserved CPUs and the number of available CPUs
> on
> > the server. Basically, the per container network bandwidth limitation was
> > applied but Mesos was not aware of it. Using the CPU as a proxy for the
> > amount of network bandwidth protected us from situations where an agent
> > could allocate more network bandwidth than available on the agent.
> However,
> > this model reached its limits when we introduced big consumers of network
> > bandwidth in our clusters. They had to raise the number of CPUs to get
> more
> > network bandwidth and therefore it introduced scheduling issues.
> >
> > Hence, we decided to leverage Mesos custom resources to let our users
> > declare their requirements but also to decouple network bandwidth from
> CPU
> > to avoid scheduling issues. We first declared the network bandwidth
> > resource on every Mesos agents even if tasks were not declaring any.
> Then,
> > we faced a first issue: the lack of support of network bandwidth and/or
> > custom resources in Marathon and Aurora (well it seems in most frameworks
> > actually). This led to a second issue: we needed Mesos to account for the
> > network bandwidth of all tasks even if some frameworks were not
> supporting
> > it yet. Solving the second problem allowed us to run a smooth migration
> by
> > patching frameworks independently in a second phase.
> >
> > On the way we found out that the “custom resources” system wasn’t meeting
> > our needs, because it only allows for “optional resources”, and not
> > “mandatory resources” (resources that should be accounted for all tasks
> in
> > a cluster, even if not required explicitly, like CPU, RAM or disk space.
> > Good candidates are network bandwidth or disk I/O for instance).
> >
> > To enforce the usage of network bandwidth across all tasks we wanted to
> > allocate an implicit amount of network bandwidth to tasks not declaring
> any
> > in their configuration. One possible implementation was to make the Mesos
> > master automatically compute the allocated network bandwidth for the task
> > when the offer is accepted and subtract this amount from the overall
> > available resources in Mesos. We consider this implicit allocation as a
> > fallback mechanism for frameworks not supporting "mandatory" resources.
> > Indeed, in a proper environment all frameworks would support these
> > mandatory resources. Unfortunately, adding support for a new resource (or
> > for custom resources) in all frameworks might not be manageable in a
> timely
> > manner, especially in an ecosystem with multiple frameworks.
> >
> > Consequently, we wrote a patch in Mesos master to allocate an implicit
> > amount of network bandwidth when it is not provided in the TaskInfo. In
> our
> > case this implicit amount is computed based on the following Criteo
> > specific rule: `task_used_cpu / slave_total_cpus *
> slave_total_bandwidth`.
> >
> > Here is what happened when our frameworks were not supporting network
> > bandwidth yet: offers were sent to frameworks and they accepted or
> rejected
> > them regardless of network bandwidth available on the slave. When an
> offer
> > was accepted, the TaskInfo sent by the framework obviously did not
> contain
> > any network bandwidth but Mesos master implicitly injected some and let
> the
> > task follow its way. There were two cases then: either the slave had
> enough
> > resources to run the task and it was scheduled as expected or it did not
> > have enough resources and the task failed to be deployed and Mesos sent
> > back a TASK_ERROR to the framework. It was the responsibility of the
> > scheduler to retry with following offers. This solution created a bit of
> > extra work for the master but we tested it and ran it in production for
> few
> > weeks in several clusters of around 250 servers each and it seemed to
> work
> > well, at least with Marathon and Aurora. At this point the migration was
> > expected to be smooth because it only required a restart of all the tasks
> > for network bandwidth to be introduced cluster wide. It ended up being as
> > smooth as expected.
> >
> > In the meantime, we obviously patched Marathon and Aurora to add full
> > support of network bandwidth and avoid the potential TASK_ERROR messages
> > while keeping in mind that we'll soon host other frameworks that would
> > probably not support network bandwidth from the beginning. So we'll
> likely
> > keep our patch in the future and we think it might be a good idea to
> > introduce a hook in Mesos master to add implicit resources to tasks.
> >
> > What we propose is to introduce a method called
> > masterLaunchTaskResourceDecorator in the hook interface and call it at
> the
> > right location to let the user add whatever implicit resource he wants.
> >
> > This would give the following signature:
> >
> > ```
> >
> > Result<Resources> masterLaunchTaskResourceDecorator(
> >
> >  const Resources& slaveResources,
> >
> >  TaskInfo& task)
> >
> > ```
> >
> > Can you please tell us if such an integration point would be acceptable
> to
> > be merged upstream?
> >
> >
> > You can have a look at our current implementation here:
> >
> >
> https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth
> >  (just as a reference, it is not the base of a patch for upstream mesos).
> >
> > Thank you,
> >
> > Clément.
> >
>

Re: Adding support for implicit allocation of mandatory custom resources in Mesos

Posted by Benjamin Mahler <bm...@apache.org>.
Thanks for the thorough explanation.

Yes, it sounds acceptable and useful for assigning disk i/o and network
i/o. The error case of there not being enough resources post-injection
seems unfortunate but I don't see a way around it.

Can you file a ticket with this background?

On Thu, Oct 11, 2018 at 1:30 AM Clément Michaud <cl...@gmail.com>
wrote:

> Hello,
>
> TL;DR; we have added network bandwidth as a first-class resource in our
> clusters with a custom isolator and we have patched Mesos master to
> introduce the concept of implicit allocation of custom resources to make
> network bandwidth mandatory for all tasks. I'd like to know what you think
> about what we have implemented and if you think we could introduce a new
> hook with the aim of injecting mandatory custom resources to tasks in Mesos
> master.
>
>
> At Criteo we have implemented a custom solution in our Mesos clusters to
> prevent network noisy neighbors and to allow our users to define a custom
> amount of reserved network bandwidth per application. Please note we run
> our clusters on a flat network and we are not using any kind of network
> overlay.
>
> In order to address these use cases, we enabled the `net_cls` isolator and
> wrote an isolator using tc, conntrack and iptables, each container having a
> dedicated custom reserved amount of network bandwidth declared by
> configuration in Marathon or Aurora.
>
> In the first implementation of our solution, the resources were not
> declared in the agents and obviously not taken into account by Mesos but
> the isolator allocated an amount of network bandwidth for each task
> relative to the number of reserved CPUs and the number of available CPUs on
> the server. Basically, the per container network bandwidth limitation was
> applied but Mesos was not aware of it. Using the CPU as a proxy for the
> amount of network bandwidth protected us from situations where an agent
> could allocate more network bandwidth than available on the agent. However,
> this model reached its limits when we introduced big consumers of network
> bandwidth in our clusters. They had to raise the number of CPUs to get more
> network bandwidth and therefore it introduced scheduling issues.
>
> Hence, we decided to leverage Mesos custom resources to let our users
> declare their requirements but also to decouple network bandwidth from CPU
> to avoid scheduling issues. We first declared the network bandwidth
> resource on every Mesos agents even if tasks were not declaring any. Then,
> we faced a first issue: the lack of support of network bandwidth and/or
> custom resources in Marathon and Aurora (well it seems in most frameworks
> actually). This led to a second issue: we needed Mesos to account for the
> network bandwidth of all tasks even if some frameworks were not supporting
> it yet. Solving the second problem allowed us to run a smooth migration by
> patching frameworks independently in a second phase.
>
> On the way we found out that the “custom resources” system wasn’t meeting
> our needs, because it only allows for “optional resources”, and not
> “mandatory resources” (resources that should be accounted for all tasks in
> a cluster, even if not required explicitly, like CPU, RAM or disk space.
> Good candidates are network bandwidth or disk I/O for instance).
>
> To enforce the usage of network bandwidth across all tasks we wanted to
> allocate an implicit amount of network bandwidth to tasks not declaring any
> in their configuration. One possible implementation was to make the Mesos
> master automatically compute the allocated network bandwidth for the task
> when the offer is accepted and subtract this amount from the overall
> available resources in Mesos. We consider this implicit allocation as a
> fallback mechanism for frameworks not supporting "mandatory" resources.
> Indeed, in a proper environment all frameworks would support these
> mandatory resources. Unfortunately, adding support for a new resource (or
> for custom resources) in all frameworks might not be manageable in a timely
> manner, especially in an ecosystem with multiple frameworks.
>
> Consequently, we wrote a patch in Mesos master to allocate an implicit
> amount of network bandwidth when it is not provided in the TaskInfo. In our
> case this implicit amount is computed based on the following Criteo
> specific rule: `task_used_cpu / slave_total_cpus * slave_total_bandwidth`.
>
> Here is what happened when our frameworks were not supporting network
> bandwidth yet: offers were sent to frameworks and they accepted or rejected
> them regardless of network bandwidth available on the slave. When an offer
> was accepted, the TaskInfo sent by the framework obviously did not contain
> any network bandwidth but Mesos master implicitly injected some and let the
> task follow its way. There were two cases then: either the slave had enough
> resources to run the task and it was scheduled as expected or it did not
> have enough resources and the task failed to be deployed and Mesos sent
> back a TASK_ERROR to the framework. It was the responsibility of the
> scheduler to retry with following offers. This solution created a bit of
> extra work for the master but we tested it and ran it in production for few
> weeks in several clusters of around 250 servers each and it seemed to work
> well, at least with Marathon and Aurora. At this point the migration was
> expected to be smooth because it only required a restart of all the tasks
> for network bandwidth to be introduced cluster wide. It ended up being as
> smooth as expected.
>
> In the meantime, we obviously patched Marathon and Aurora to add full
> support of network bandwidth and avoid the potential TASK_ERROR messages
> while keeping in mind that we'll soon host other frameworks that would
> probably not support network bandwidth from the beginning. So we'll likely
> keep our patch in the future and we think it might be a good idea to
> introduce a hook in Mesos master to add implicit resources to tasks.
>
> What we propose is to introduce a method called
> masterLaunchTaskResourceDecorator in the hook interface and call it at the
> right location to let the user add whatever implicit resource he wants.
>
> This would give the following signature:
>
> ```
>
> Result<Resources> masterLaunchTaskResourceDecorator(
>
>  const Resources& slaveResources,
>
>  TaskInfo& task)
>
> ```
>
> Can you please tell us if such an integration point would be acceptable to
> be merged upstream?
>
>
> You can have a look at our current implementation here:
>
> https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth
>  (just as a reference, it is not the base of a patch for upstream mesos).
>
> Thank you,
>
> Clément.
>