You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by Zameer Manji <zm...@apache.org> on 2016/11/09 23:09:15 UTC

A sketch for supporting mesos maintenance

Hey,

This is not a design doc for supporting Mesos Maintenance, but more of a
high level overview on how we *could* support it going forward. I just
wanted to get this idea out there now to see where we all stand.

As Ankit mentioned in AURORA-1800 Mesos has had Maintenance primitives
since 0.25. You can read about them here
<http://mesos.apache.org/documentation/latest/maintenance/>. The primitives
map pretty well to our existing concept of maintenance, but they allow
operators to do work across multiple frameworks.

Since the Mesos community is growing and new frameworks are emerging all
the time, I think Aurora should support these primitives and drop our
custom primitives to be a better player in the ecosystem.

We cannot adopt these just yet however, because it is only accessible
behind the Mesos HTTP API which Aurora does not use today. Further,
`aurora_admin` has some SLA aware maintenance processes which are computed
and coordinated from the client. I think for us to successfully adopt Mesos
Maintenance, we need to do at least two things:

1. Adopt the Mesos HTTP API.
2. Move the SLA aware maintenance logic from the admin tool into the
scheduler itself, so the scheduler can coordinate with the Mesos Master in
an SLA aware fashion.

What do folks think?

-- 
Zameer Manji

Re: A sketch for supporting mesos maintenance

Posted by Zameer Manji <zm...@apache.org>.
Mesos 1.1.0 is shipping
<https://github.com/apache/mesos/blob/8822a29bce4b4c1f79ed25823c8fccbb47b1660c/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp>
an implementation of the SchedulerDriver interface that uses the HTTP API
under the hood. Adopting this implementation seems straightforward,
although we would not be able to accept Mesos maintenance requests just yet.

Once the community has proved out the HTTP API works in practice, I was
thinking about adopting the JNI implementation
<https://github.com/apache/mesos/blob/8822a29bce4b4c1f79ed25823c8fccbb47b1660c/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp>
of the HTTP API which would allow us to to accept the maintenance requests.
This might be a lot of work, because the shape of the HTTP API is very
different from the SchedulerDriver API.

Maintenance state is surfaced in the offer in the `Unavailability` field
<https://github.com/apache/mesos/blob/8822a29bce4b4c1f79ed25823c8fccbb47b1660c/include/mesos/mesos.proto#L1278>
.


On Wed, Nov 9, 2016 at 7:13 PM, Bill Farner <wf...@apache.org> wrote:

> (1) sounds like an inevitability, do you have a sense of what stands in the
> way, or what it will take?
>
> (2) is a win for ending behavior redundancy. This is probably in the doc,
> but I'm lazy - are maintenance statuses surfaced in offers? IIRC the
> original incarnation of maintenance modes in mesos didn't surface that
> info, which eliminated important state for scheduling.
>
> On Wed, Nov 9, 2016 at 3:09 PM Zameer Manji <zm...@apache.org> wrote:
>
> > Hey,
> >
> > This is not a design doc for supporting Mesos Maintenance, but more of a
> > high level overview on how we *could* support it going forward. I just
> > wanted to get this idea out there now to see where we all stand.
> >
> > As Ankit mentioned in AURORA-1800 Mesos has had Maintenance primitives
> > since 0.25. You can read about them here
> > <http://mesos.apache.org/documentation/latest/maintenance/>. The
> > primitives
> > map pretty well to our existing concept of maintenance, but they allow
> > operators to do work across multiple frameworks.
> >
> > Since the Mesos community is growing and new frameworks are emerging all
> > the time, I think Aurora should support these primitives and drop our
> > custom primitives to be a better player in the ecosystem.
> >
> > We cannot adopt these just yet however, because it is only accessible
> > behind the Mesos HTTP API which Aurora does not use today. Further,
> > `aurora_admin` has some SLA aware maintenance processes which are
> computed
> > and coordinated from the client. I think for us to successfully adopt
> Mesos
> > Maintenance, we need to do at least two things:
> >
> > 1. Adopt the Mesos HTTP API.
> > 2. Move the SLA aware maintenance logic from the admin tool into the
> > scheduler itself, so the scheduler can coordinate with the Mesos Master
> in
> > an SLA aware fashion.
> >
> > What do folks think?
> >
> > --
> > Zameer Manji
> >
>
> --
> Zameer Manji
>

Re: A sketch for supporting mesos maintenance

Posted by Bill Farner <wf...@apache.org>.
(1) sounds like an inevitability, do you have a sense of what stands in the
way, or what it will take?

(2) is a win for ending behavior redundancy. This is probably in the doc,
but I'm lazy - are maintenance statuses surfaced in offers? IIRC the
original incarnation of maintenance modes in mesos didn't surface that
info, which eliminated important state for scheduling.

On Wed, Nov 9, 2016 at 3:09 PM Zameer Manji <zm...@apache.org> wrote:

> Hey,
>
> This is not a design doc for supporting Mesos Maintenance, but more of a
> high level overview on how we *could* support it going forward. I just
> wanted to get this idea out there now to see where we all stand.
>
> As Ankit mentioned in AURORA-1800 Mesos has had Maintenance primitives
> since 0.25. You can read about them here
> <http://mesos.apache.org/documentation/latest/maintenance/>. The
> primitives
> map pretty well to our existing concept of maintenance, but they allow
> operators to do work across multiple frameworks.
>
> Since the Mesos community is growing and new frameworks are emerging all
> the time, I think Aurora should support these primitives and drop our
> custom primitives to be a better player in the ecosystem.
>
> We cannot adopt these just yet however, because it is only accessible
> behind the Mesos HTTP API which Aurora does not use today. Further,
> `aurora_admin` has some SLA aware maintenance processes which are computed
> and coordinated from the client. I think for us to successfully adopt Mesos
> Maintenance, we need to do at least two things:
>
> 1. Adopt the Mesos HTTP API.
> 2. Move the SLA aware maintenance logic from the admin tool into the
> scheduler itself, so the scheduler can coordinate with the Mesos Master in
> an SLA aware fashion.
>
> What do folks think?
>
> --
> Zameer Manji
>