You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benjamin Mahler <be...@gmail.com> on 2014/08/25 21:24:19 UTC

Design Review: Maintenance Primitives

Hi all,

I wanted to take a moment to thank Alexandra Sava, who completed her OPW
internship this past week. We worked together in the second half of her
internship to create a design document for maintenance primitives in Mesos
(the original ticket is MESOS-1474
<https://issues.apache.org/jira/browse/MESOS-1474>, but the design document
is the most up-to-date plan).

Maintenance in this context consists of anything that requires the tasks
running on the slave to be killed (e.g. kernel upgrades, machine
decommissioning, non-recoverable mesos upgrades / configuration changes,
etc).

The desire is to expose maintenance events to frameworks in a generic
manner, as to allow frameworks to respect their SLAs, perform better task
placement, and migrate tasks if necessary.

The design document is here:
https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing

Please take a moment before the end of next week to go over this
design. *Higher
level feedback and questions can be discussed most effectively in this
thread.*

Let's thank Alexandra for her work!

Ben

Re: Design Review: Maintenance Primitives

Posted by Benjamin Mahler <be...@gmail.com>.
Re: terminology

An offer can be rescinded, resources can be revoked. Only inverse offers
with a hard deadline can lead to the revocation of the resources. In this
sense, an inverse offer is more akin to a request to release allocated
resources; hard inverse offers are also a revocation forewarning.

The idea here was to think of Offer and InverseOffer (both nouns) as two
opposite flavors of offer objects: one offering resources to the framework,
the other offering resources back to mesos. Of course, the InverseOffer is
not a perfect inverse as far as it's mechanics are concerned. :)

If we think of InverseOffers as an enhancement to the offer ecosystem, in
this case we can leverage the existing rescind mechanism to retract inverse
offers. That is, rescindOffer can retract either flavors of offers.


Re: including hostname in InverseOffer

Not intentional, currently the 'Offer' message has more slave information
(SlaveID, hostname, attributes). I can add these to 'InverseOffer' if you
think they will be valuable for you.


Re: automation

Yes definitely, I will include a section on future work in the maintenance
space to reflect your point as well as some of the shorter term
enhancements that can provide more automation.


On Wed, Aug 27, 2014 at 6:36 PM, Sharma Podila <sp...@netflix.com> wrote:

> Nicely written doc. Here's a few thoughts:
>
> - There's some commonality between the existing offerRescinded() and the
> new inverseOffer(). Maybe consider having same method names for them with
> differing signatures? I'd second Maxime's point about possibly renaming
> inverseOffer to something else - maybe offerRescind() or offerRevoke()?
>
> - Offer has hostname but InverseOffer doesn't, is that intentional?
>
> - I like it that the operations of maintenance Vs. draining are separated
> out. Draining deactivates the slave, and should also immediately rescind
> all outstanding offers (I suppose using the offers would result in
> TASK_LOST, but it would play nice with frameworks if offers are rescinded
> proactively).
>
> - For maintenance across all or a large part of the cluster, these
> maintenance primitives would be helpful. Another piece required to
> achieving fully automated maintenance (say, upgrade kernel patch on all
> slaves) would be to have a maintenance orchestration engine that has
> constraints such as "ensure not more than X% of slave type A are down for
> maintenance concurrently". That is, automated rolling upgrades with SLA on
> uptime/availability. Such an engine could accomplish its task using these
> primitives.
>
>
>
>
>
>
> On Tue, Aug 26, 2014 at 2:23 PM, Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
>
>> Glad to see that you are really thinking this through.
>>
>> Yes it's explicit that resources won't be revoked and will stay
>> outstanding in this case but I would just add that the slave won't enter
>> the "drained" state. It's just hard to follow the
>> drain/revoke/outstanding/inverse offer/reclaim vocabulary. Actually, did
>> you also think about the name? Inverse offer sounds weird to me. Maybe
>> resourceOffers()  and resource Revoke()? You probably have better arguments
>> and idea than me though :)
>>
>> Another small note: the OfferID in the inverse offer is completely new
>> and just used to identify the inverse offer right? I got a bit confused
>> about a link between a previous offerID and this but then I saw the
>> Resource field. Wouldn't it be clearer to have InverseOfferID?
>>
>> Thanks for the work! I really want to have these primitives.
>> On Aug 26, 2014 10:59 PM, "Benjamin Mahler" <be...@gmail.com>
>> wrote:
>>
>>> You're right, we don't account for that in the current design because
>>> such a framework would be relying on disk resources outside of the sandbox.
>>> Currently, we don't have a model for these "persistent" resources (e.g.
>>> disk volume used for HDFS DataNode data). Unlike the existing resources,
>>> persistent resources will not be tied to the lifecycle of the executor/task.
>>>
>>> When we have a model for persistent resources, I can see this fitting
>>> into the primitives we are proposing here. Since inverse offers work at the
>>> resource level, we can provide control to the operators to determine
>>> whether the persistent resources should be reclaimed from the framework as
>>> part of the maintenance:
>>>
>>> E.g. If decommissioning a machine, the operator can ensure that all
>>> persistent resources are reclaimed. If rebooting a machine, the operator
>>> can leave these resources allocated to the framework for when the machine
>>> is back in the cluster.
>>>
>>> Now, since we have the soft deadlines on inverse offers, a framework
>>> like HDFS can determine when it can comply to inverse offers based on the
>>> global data replication state (e.g. always ensure that 2/3 replicas of a
>>> block are available). If relinquishing a particular data volume would mean
>>> that only 1 copy of a block is available, the framework can wait to comply
>>> with the inverse offer, or can take steps to create more replicas.
>>>
>>> One interesting question is how the resource expiry time will interact
>>> with persistent resources, we may want to expose the expiry time at the
>>> resource level rather than the offer level. Will think about this.
>>>
>>> *However could you specify that when you drain a slave with hard:false
>>>> you don't enter the drained state even when the deadline has passed if
>>>> tasks are still running? This is not explicit in the document and we want
>>>> to make sure operators have the information about this and could avoid
>>>> unfortunate rolling restarts.*
>>>
>>>
>>> This is explicit in the document under the soft deadline section: the
>>> inverse offer will remain outstanding after the soft deadline elapses, we
>>> won't forcibly drain the task. Anything that's not clear here?
>>>
>>>
>>>
>>>
>>> On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <
>>> maxime.brugidou@gmail.com> wrote:
>>>
>>>> Nice work!
>>>>
>>>> First question: don't you think that operations should differentiate
>>>> short and long maintenance?
>>>> I am thinking about frameworks that use persistent storage on disk for
>>>> example. A short maintenance such as a slave reboot or upgrade could be
>>>> done without moving the data to another slave. However decommissioning
>>>> requires to drain the storage too.
>>>>
>>>> If you have an HDFS datanode with 50TB of (replicated) data, you might
>>>> not want to drain it for a reboot (assuming your replication factor is high
>>>> enough) since it takes ages. However for decommission it might make sense
>>>> to drain it.
>>>>
>>>> Not sure if this is a good example but I feel the need to know if the
>>>> maintenance is planned to be short or is forever. I know this does not fit
>>>> the nice modeling you describe :-/
>>>>
>>>> Actually for HDFS we could define a threshold where "good enough"
>>>> replication without the slave would be considered enough and thus we could
>>>> deactivate the slave. This would prevent a rolling restart to go too fast.
>>>> However could you specify that when you drain a slave with hard:false you
>>>> don't enter the drained state even when the deadline has passed if tasks
>>>> are still running? This is not explicit in the document and we want to make
>>>> sure operators have the information about this and could avoid unfortunate
>>>> rolling restarts.
>>>>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I wanted to take a moment to thank Alexandra Sava, who completed her
>>>>> OPW internship this past week. We worked together in the second half of her
>>>>> internship to create a design document for maintenance primitives in Mesos
>>>>> (the original ticket is MESOS-1474
>>>>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>>>>> document is the most up-to-date plan).
>>>>>
>>>>> Maintenance in this context consists of anything that requires the
>>>>> tasks running on the slave to be killed (e.g. kernel upgrades, machine
>>>>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>>>>> etc).
>>>>>
>>>>> The desire is to expose maintenance events to frameworks in a generic
>>>>> manner, as to allow frameworks to respect their SLAs, perform better task
>>>>> placement, and migrate tasks if necessary.
>>>>>
>>>>> The design document is here:
>>>>>
>>>>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>>>>
>>>>> Please take a moment before the end of next week to go over this
>>>>> design. *Higher level feedback and questions can be discussed most
>>>>> effectively in this thread.*
>>>>>
>>>>> Let's thank Alexandra for her work!
>>>>>
>>>>> Ben
>>>>>
>>>>
>>>
>

Re: Design Review: Maintenance Primitives

Posted by Benjamin Mahler <be...@gmail.com>.
Re: terminology

An offer can be rescinded, resources can be revoked. Only inverse offers
with a hard deadline can lead to the revocation of the resources. In this
sense, an inverse offer is more akin to a request to release allocated
resources; hard inverse offers are also a revocation forewarning.

The idea here was to think of Offer and InverseOffer (both nouns) as two
opposite flavors of offer objects: one offering resources to the framework,
the other offering resources back to mesos. Of course, the InverseOffer is
not a perfect inverse as far as it's mechanics are concerned. :)

If we think of InverseOffers as an enhancement to the offer ecosystem, in
this case we can leverage the existing rescind mechanism to retract inverse
offers. That is, rescindOffer can retract either flavors of offers.


Re: including hostname in InverseOffer

Not intentional, currently the 'Offer' message has more slave information
(SlaveID, hostname, attributes). I can add these to 'InverseOffer' if you
think they will be valuable for you.


Re: automation

Yes definitely, I will include a section on future work in the maintenance
space to reflect your point as well as some of the shorter term
enhancements that can provide more automation.


On Wed, Aug 27, 2014 at 6:36 PM, Sharma Podila <sp...@netflix.com> wrote:

> Nicely written doc. Here's a few thoughts:
>
> - There's some commonality between the existing offerRescinded() and the
> new inverseOffer(). Maybe consider having same method names for them with
> differing signatures? I'd second Maxime's point about possibly renaming
> inverseOffer to something else - maybe offerRescind() or offerRevoke()?
>
> - Offer has hostname but InverseOffer doesn't, is that intentional?
>
> - I like it that the operations of maintenance Vs. draining are separated
> out. Draining deactivates the slave, and should also immediately rescind
> all outstanding offers (I suppose using the offers would result in
> TASK_LOST, but it would play nice with frameworks if offers are rescinded
> proactively).
>
> - For maintenance across all or a large part of the cluster, these
> maintenance primitives would be helpful. Another piece required to
> achieving fully automated maintenance (say, upgrade kernel patch on all
> slaves) would be to have a maintenance orchestration engine that has
> constraints such as "ensure not more than X% of slave type A are down for
> maintenance concurrently". That is, automated rolling upgrades with SLA on
> uptime/availability. Such an engine could accomplish its task using these
> primitives.
>
>
>
>
>
>
> On Tue, Aug 26, 2014 at 2:23 PM, Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
>
>> Glad to see that you are really thinking this through.
>>
>> Yes it's explicit that resources won't be revoked and will stay
>> outstanding in this case but I would just add that the slave won't enter
>> the "drained" state. It's just hard to follow the
>> drain/revoke/outstanding/inverse offer/reclaim vocabulary. Actually, did
>> you also think about the name? Inverse offer sounds weird to me. Maybe
>> resourceOffers()  and resource Revoke()? You probably have better arguments
>> and idea than me though :)
>>
>> Another small note: the OfferID in the inverse offer is completely new
>> and just used to identify the inverse offer right? I got a bit confused
>> about a link between a previous offerID and this but then I saw the
>> Resource field. Wouldn't it be clearer to have InverseOfferID?
>>
>> Thanks for the work! I really want to have these primitives.
>> On Aug 26, 2014 10:59 PM, "Benjamin Mahler" <be...@gmail.com>
>> wrote:
>>
>>> You're right, we don't account for that in the current design because
>>> such a framework would be relying on disk resources outside of the sandbox.
>>> Currently, we don't have a model for these "persistent" resources (e.g.
>>> disk volume used for HDFS DataNode data). Unlike the existing resources,
>>> persistent resources will not be tied to the lifecycle of the executor/task.
>>>
>>> When we have a model for persistent resources, I can see this fitting
>>> into the primitives we are proposing here. Since inverse offers work at the
>>> resource level, we can provide control to the operators to determine
>>> whether the persistent resources should be reclaimed from the framework as
>>> part of the maintenance:
>>>
>>> E.g. If decommissioning a machine, the operator can ensure that all
>>> persistent resources are reclaimed. If rebooting a machine, the operator
>>> can leave these resources allocated to the framework for when the machine
>>> is back in the cluster.
>>>
>>> Now, since we have the soft deadlines on inverse offers, a framework
>>> like HDFS can determine when it can comply to inverse offers based on the
>>> global data replication state (e.g. always ensure that 2/3 replicas of a
>>> block are available). If relinquishing a particular data volume would mean
>>> that only 1 copy of a block is available, the framework can wait to comply
>>> with the inverse offer, or can take steps to create more replicas.
>>>
>>> One interesting question is how the resource expiry time will interact
>>> with persistent resources, we may want to expose the expiry time at the
>>> resource level rather than the offer level. Will think about this.
>>>
>>> *However could you specify that when you drain a slave with hard:false
>>>> you don't enter the drained state even when the deadline has passed if
>>>> tasks are still running? This is not explicit in the document and we want
>>>> to make sure operators have the information about this and could avoid
>>>> unfortunate rolling restarts.*
>>>
>>>
>>> This is explicit in the document under the soft deadline section: the
>>> inverse offer will remain outstanding after the soft deadline elapses, we
>>> won't forcibly drain the task. Anything that's not clear here?
>>>
>>>
>>>
>>>
>>> On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <
>>> maxime.brugidou@gmail.com> wrote:
>>>
>>>> Nice work!
>>>>
>>>> First question: don't you think that operations should differentiate
>>>> short and long maintenance?
>>>> I am thinking about frameworks that use persistent storage on disk for
>>>> example. A short maintenance such as a slave reboot or upgrade could be
>>>> done without moving the data to another slave. However decommissioning
>>>> requires to drain the storage too.
>>>>
>>>> If you have an HDFS datanode with 50TB of (replicated) data, you might
>>>> not want to drain it for a reboot (assuming your replication factor is high
>>>> enough) since it takes ages. However for decommission it might make sense
>>>> to drain it.
>>>>
>>>> Not sure if this is a good example but I feel the need to know if the
>>>> maintenance is planned to be short or is forever. I know this does not fit
>>>> the nice modeling you describe :-/
>>>>
>>>> Actually for HDFS we could define a threshold where "good enough"
>>>> replication without the slave would be considered enough and thus we could
>>>> deactivate the slave. This would prevent a rolling restart to go too fast.
>>>> However could you specify that when you drain a slave with hard:false you
>>>> don't enter the drained state even when the deadline has passed if tasks
>>>> are still running? This is not explicit in the document and we want to make
>>>> sure operators have the information about this and could avoid unfortunate
>>>> rolling restarts.
>>>>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I wanted to take a moment to thank Alexandra Sava, who completed her
>>>>> OPW internship this past week. We worked together in the second half of her
>>>>> internship to create a design document for maintenance primitives in Mesos
>>>>> (the original ticket is MESOS-1474
>>>>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>>>>> document is the most up-to-date plan).
>>>>>
>>>>> Maintenance in this context consists of anything that requires the
>>>>> tasks running on the slave to be killed (e.g. kernel upgrades, machine
>>>>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>>>>> etc).
>>>>>
>>>>> The desire is to expose maintenance events to frameworks in a generic
>>>>> manner, as to allow frameworks to respect their SLAs, perform better task
>>>>> placement, and migrate tasks if necessary.
>>>>>
>>>>> The design document is here:
>>>>>
>>>>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>>>>
>>>>> Please take a moment before the end of next week to go over this
>>>>> design. *Higher level feedback and questions can be discussed most
>>>>> effectively in this thread.*
>>>>>
>>>>> Let's thank Alexandra for her work!
>>>>>
>>>>> Ben
>>>>>
>>>>
>>>
>

Re: Design Review: Maintenance Primitives

Posted by Sharma Podila <sp...@netflix.com>.
Nicely written doc. Here's a few thoughts:

- There's some commonality between the existing offerRescinded() and the
new inverseOffer(). Maybe consider having same method names for them with
differing signatures? I'd second Maxime's point about possibly renaming
inverseOffer to something else - maybe offerRescind() or offerRevoke()?

- Offer has hostname but InverseOffer doesn't, is that intentional?

- I like it that the operations of maintenance Vs. draining are separated
out. Draining deactivates the slave, and should also immediately rescind
all outstanding offers (I suppose using the offers would result in
TASK_LOST, but it would play nice with frameworks if offers are rescinded
proactively).

- For maintenance across all or a large part of the cluster, these
maintenance primitives would be helpful. Another piece required to
achieving fully automated maintenance (say, upgrade kernel patch on all
slaves) would be to have a maintenance orchestration engine that has
constraints such as "ensure not more than X% of slave type A are down for
maintenance concurrently". That is, automated rolling upgrades with SLA on
uptime/availability. Such an engine could accomplish its task using these
primitives.






On Tue, Aug 26, 2014 at 2:23 PM, Maxime Brugidou <ma...@gmail.com>
wrote:

> Glad to see that you are really thinking this through.
>
> Yes it's explicit that resources won't be revoked and will stay
> outstanding in this case but I would just add that the slave won't enter
> the "drained" state. It's just hard to follow the
> drain/revoke/outstanding/inverse offer/reclaim vocabulary. Actually, did
> you also think about the name? Inverse offer sounds weird to me. Maybe
> resourceOffers()  and resource Revoke()? You probably have better arguments
> and idea than me though :)
>
> Another small note: the OfferID in the inverse offer is completely new and
> just used to identify the inverse offer right? I got a bit confused about a
> link between a previous offerID and this but then I saw the Resource field.
> Wouldn't it be clearer to have InverseOfferID?
>
> Thanks for the work! I really want to have these primitives.
> On Aug 26, 2014 10:59 PM, "Benjamin Mahler" <be...@gmail.com>
> wrote:
>
>> You're right, we don't account for that in the current design because
>> such a framework would be relying on disk resources outside of the sandbox.
>> Currently, we don't have a model for these "persistent" resources (e.g.
>> disk volume used for HDFS DataNode data). Unlike the existing resources,
>> persistent resources will not be tied to the lifecycle of the executor/task.
>>
>> When we have a model for persistent resources, I can see this fitting
>> into the primitives we are proposing here. Since inverse offers work at the
>> resource level, we can provide control to the operators to determine
>> whether the persistent resources should be reclaimed from the framework as
>> part of the maintenance:
>>
>> E.g. If decommissioning a machine, the operator can ensure that all
>> persistent resources are reclaimed. If rebooting a machine, the operator
>> can leave these resources allocated to the framework for when the machine
>> is back in the cluster.
>>
>> Now, since we have the soft deadlines on inverse offers, a framework like
>> HDFS can determine when it can comply to inverse offers based on the global
>> data replication state (e.g. always ensure that 2/3 replicas of a block are
>> available). If relinquishing a particular data volume would mean that only
>> 1 copy of a block is available, the framework can wait to comply with the
>> inverse offer, or can take steps to create more replicas.
>>
>> One interesting question is how the resource expiry time will interact
>> with persistent resources, we may want to expose the expiry time at the
>> resource level rather than the offer level. Will think about this.
>>
>> *However could you specify that when you drain a slave with hard:false
>>> you don't enter the drained state even when the deadline has passed if
>>> tasks are still running? This is not explicit in the document and we want
>>> to make sure operators have the information about this and could avoid
>>> unfortunate rolling restarts.*
>>
>>
>> This is explicit in the document under the soft deadline section: the
>> inverse offer will remain outstanding after the soft deadline elapses, we
>> won't forcibly drain the task. Anything that's not clear here?
>>
>>
>>
>>
>> On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <
>> maxime.brugidou@gmail.com> wrote:
>>
>>> Nice work!
>>>
>>> First question: don't you think that operations should differentiate
>>> short and long maintenance?
>>> I am thinking about frameworks that use persistent storage on disk for
>>> example. A short maintenance such as a slave reboot or upgrade could be
>>> done without moving the data to another slave. However decommissioning
>>> requires to drain the storage too.
>>>
>>> If you have an HDFS datanode with 50TB of (replicated) data, you might
>>> not want to drain it for a reboot (assuming your replication factor is high
>>> enough) since it takes ages. However for decommission it might make sense
>>> to drain it.
>>>
>>> Not sure if this is a good example but I feel the need to know if the
>>> maintenance is planned to be short or is forever. I know this does not fit
>>> the nice modeling you describe :-/
>>>
>>> Actually for HDFS we could define a threshold where "good enough"
>>> replication without the slave would be considered enough and thus we could
>>> deactivate the slave. This would prevent a rolling restart to go too fast.
>>> However could you specify that when you drain a slave with hard:false you
>>> don't enter the drained state even when the deadline has passed if tasks
>>> are still running? This is not explicit in the document and we want to make
>>> sure operators have the information about this and could avoid unfortunate
>>> rolling restarts.
>>>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I wanted to take a moment to thank Alexandra Sava, who completed her
>>>> OPW internship this past week. We worked together in the second half of her
>>>> internship to create a design document for maintenance primitives in Mesos
>>>> (the original ticket is MESOS-1474
>>>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>>>> document is the most up-to-date plan).
>>>>
>>>> Maintenance in this context consists of anything that requires the
>>>> tasks running on the slave to be killed (e.g. kernel upgrades, machine
>>>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>>>> etc).
>>>>
>>>> The desire is to expose maintenance events to frameworks in a generic
>>>> manner, as to allow frameworks to respect their SLAs, perform better task
>>>> placement, and migrate tasks if necessary.
>>>>
>>>> The design document is here:
>>>>
>>>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>>>
>>>> Please take a moment before the end of next week to go over this
>>>> design. *Higher level feedback and questions can be discussed most
>>>> effectively in this thread.*
>>>>
>>>> Let's thank Alexandra for her work!
>>>>
>>>> Ben
>>>>
>>>
>>

Re: Design Review: Maintenance Primitives

Posted by Maxime Brugidou <ma...@gmail.com>.
Glad to see that you are really thinking this through.

Yes it's explicit that resources won't be revoked and will stay outstanding
in this case but I would just add that the slave won't enter the "drained"
state. It's just hard to follow the drain/revoke/outstanding/inverse
offer/reclaim vocabulary. Actually, did you also think about the name?
Inverse offer sounds weird to me. Maybe resourceOffers()  and resource
Revoke()? You probably have better arguments and idea than me though :)

Another small note: the OfferID in the inverse offer is completely new and
just used to identify the inverse offer right? I got a bit confused about a
link between a previous offerID and this but then I saw the Resource field.
Wouldn't it be clearer to have InverseOfferID?

Thanks for the work! I really want to have these primitives.
On Aug 26, 2014 10:59 PM, "Benjamin Mahler" <be...@gmail.com>
wrote:

> You're right, we don't account for that in the current design because such
> a framework would be relying on disk resources outside of the sandbox.
> Currently, we don't have a model for these "persistent" resources (e.g.
> disk volume used for HDFS DataNode data). Unlike the existing resources,
> persistent resources will not be tied to the lifecycle of the executor/task.
>
> When we have a model for persistent resources, I can see this fitting into
> the primitives we are proposing here. Since inverse offers work at the
> resource level, we can provide control to the operators to determine
> whether the persistent resources should be reclaimed from the framework as
> part of the maintenance:
>
> E.g. If decommissioning a machine, the operator can ensure that all
> persistent resources are reclaimed. If rebooting a machine, the operator
> can leave these resources allocated to the framework for when the machine
> is back in the cluster.
>
> Now, since we have the soft deadlines on inverse offers, a framework like
> HDFS can determine when it can comply to inverse offers based on the global
> data replication state (e.g. always ensure that 2/3 replicas of a block are
> available). If relinquishing a particular data volume would mean that only
> 1 copy of a block is available, the framework can wait to comply with the
> inverse offer, or can take steps to create more replicas.
>
> One interesting question is how the resource expiry time will interact
> with persistent resources, we may want to expose the expiry time at the
> resource level rather than the offer level. Will think about this.
>
> *However could you specify that when you drain a slave with hard:false you
>> don't enter the drained state even when the deadline has passed if tasks
>> are still running? This is not explicit in the document and we want to make
>> sure operators have the information about this and could avoid unfortunate
>> rolling restarts.*
>
>
> This is explicit in the document under the soft deadline section: the
> inverse offer will remain outstanding after the soft deadline elapses, we
> won't forcibly drain the task. Anything that's not clear here?
>
>
>
>
> On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
>
>> Nice work!
>>
>> First question: don't you think that operations should differentiate
>> short and long maintenance?
>> I am thinking about frameworks that use persistent storage on disk for
>> example. A short maintenance such as a slave reboot or upgrade could be
>> done without moving the data to another slave. However decommissioning
>> requires to drain the storage too.
>>
>> If you have an HDFS datanode with 50TB of (replicated) data, you might
>> not want to drain it for a reboot (assuming your replication factor is high
>> enough) since it takes ages. However for decommission it might make sense
>> to drain it.
>>
>> Not sure if this is a good example but I feel the need to know if the
>> maintenance is planned to be short or is forever. I know this does not fit
>> the nice modeling you describe :-/
>>
>> Actually for HDFS we could define a threshold where "good enough"
>> replication without the slave would be considered enough and thus we could
>> deactivate the slave. This would prevent a rolling restart to go too fast.
>> However could you specify that when you drain a slave with hard:false you
>> don't enter the drained state even when the deadline has passed if tasks
>> are still running? This is not explicit in the document and we want to make
>> sure operators have the information about this and could avoid unfortunate
>> rolling restarts.
>>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
>>> internship this past week. We worked together in the second half of her
>>> internship to create a design document for maintenance primitives in Mesos
>>> (the original ticket is MESOS-1474
>>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>>> document is the most up-to-date plan).
>>>
>>> Maintenance in this context consists of anything that requires the tasks
>>> running on the slave to be killed (e.g. kernel upgrades, machine
>>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>>> etc).
>>>
>>> The desire is to expose maintenance events to frameworks in a generic
>>> manner, as to allow frameworks to respect their SLAs, perform better task
>>> placement, and migrate tasks if necessary.
>>>
>>> The design document is here:
>>>
>>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>>
>>> Please take a moment before the end of next week to go over this design. *Higher
>>> level feedback and questions can be discussed most effectively in this
>>> thread.*
>>>
>>> Let's thank Alexandra for her work!
>>>
>>> Ben
>>>
>>
>

Re: Design Review: Maintenance Primitives

Posted by Maxime Brugidou <ma...@gmail.com>.
Glad to see that you are really thinking this through.

Yes it's explicit that resources won't be revoked and will stay outstanding
in this case but I would just add that the slave won't enter the "drained"
state. It's just hard to follow the drain/revoke/outstanding/inverse
offer/reclaim vocabulary. Actually, did you also think about the name?
Inverse offer sounds weird to me. Maybe resourceOffers()  and resource
Revoke()? You probably have better arguments and idea than me though :)

Another small note: the OfferID in the inverse offer is completely new and
just used to identify the inverse offer right? I got a bit confused about a
link between a previous offerID and this but then I saw the Resource field.
Wouldn't it be clearer to have InverseOfferID?

Thanks for the work! I really want to have these primitives.
On Aug 26, 2014 10:59 PM, "Benjamin Mahler" <be...@gmail.com>
wrote:

> You're right, we don't account for that in the current design because such
> a framework would be relying on disk resources outside of the sandbox.
> Currently, we don't have a model for these "persistent" resources (e.g.
> disk volume used for HDFS DataNode data). Unlike the existing resources,
> persistent resources will not be tied to the lifecycle of the executor/task.
>
> When we have a model for persistent resources, I can see this fitting into
> the primitives we are proposing here. Since inverse offers work at the
> resource level, we can provide control to the operators to determine
> whether the persistent resources should be reclaimed from the framework as
> part of the maintenance:
>
> E.g. If decommissioning a machine, the operator can ensure that all
> persistent resources are reclaimed. If rebooting a machine, the operator
> can leave these resources allocated to the framework for when the machine
> is back in the cluster.
>
> Now, since we have the soft deadlines on inverse offers, a framework like
> HDFS can determine when it can comply to inverse offers based on the global
> data replication state (e.g. always ensure that 2/3 replicas of a block are
> available). If relinquishing a particular data volume would mean that only
> 1 copy of a block is available, the framework can wait to comply with the
> inverse offer, or can take steps to create more replicas.
>
> One interesting question is how the resource expiry time will interact
> with persistent resources, we may want to expose the expiry time at the
> resource level rather than the offer level. Will think about this.
>
> *However could you specify that when you drain a slave with hard:false you
>> don't enter the drained state even when the deadline has passed if tasks
>> are still running? This is not explicit in the document and we want to make
>> sure operators have the information about this and could avoid unfortunate
>> rolling restarts.*
>
>
> This is explicit in the document under the soft deadline section: the
> inverse offer will remain outstanding after the soft deadline elapses, we
> won't forcibly drain the task. Anything that's not clear here?
>
>
>
>
> On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
>
>> Nice work!
>>
>> First question: don't you think that operations should differentiate
>> short and long maintenance?
>> I am thinking about frameworks that use persistent storage on disk for
>> example. A short maintenance such as a slave reboot or upgrade could be
>> done without moving the data to another slave. However decommissioning
>> requires to drain the storage too.
>>
>> If you have an HDFS datanode with 50TB of (replicated) data, you might
>> not want to drain it for a reboot (assuming your replication factor is high
>> enough) since it takes ages. However for decommission it might make sense
>> to drain it.
>>
>> Not sure if this is a good example but I feel the need to know if the
>> maintenance is planned to be short or is forever. I know this does not fit
>> the nice modeling you describe :-/
>>
>> Actually for HDFS we could define a threshold where "good enough"
>> replication without the slave would be considered enough and thus we could
>> deactivate the slave. This would prevent a rolling restart to go too fast.
>> However could you specify that when you drain a slave with hard:false you
>> don't enter the drained state even when the deadline has passed if tasks
>> are still running? This is not explicit in the document and we want to make
>> sure operators have the information about this and could avoid unfortunate
>> rolling restarts.
>>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
>>> internship this past week. We worked together in the second half of her
>>> internship to create a design document for maintenance primitives in Mesos
>>> (the original ticket is MESOS-1474
>>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>>> document is the most up-to-date plan).
>>>
>>> Maintenance in this context consists of anything that requires the tasks
>>> running on the slave to be killed (e.g. kernel upgrades, machine
>>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>>> etc).
>>>
>>> The desire is to expose maintenance events to frameworks in a generic
>>> manner, as to allow frameworks to respect their SLAs, perform better task
>>> placement, and migrate tasks if necessary.
>>>
>>> The design document is here:
>>>
>>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>>
>>> Please take a moment before the end of next week to go over this design. *Higher
>>> level feedback and questions can be discussed most effectively in this
>>> thread.*
>>>
>>> Let's thank Alexandra for her work!
>>>
>>> Ben
>>>
>>
>

Re: Design Review: Maintenance Primitives

Posted by Benjamin Mahler <be...@gmail.com>.
You're right, we don't account for that in the current design because such
a framework would be relying on disk resources outside of the sandbox.
Currently, we don't have a model for these "persistent" resources (e.g.
disk volume used for HDFS DataNode data). Unlike the existing resources,
persistent resources will not be tied to the lifecycle of the executor/task.

When we have a model for persistent resources, I can see this fitting into
the primitives we are proposing here. Since inverse offers work at the
resource level, we can provide control to the operators to determine
whether the persistent resources should be reclaimed from the framework as
part of the maintenance:

E.g. If decommissioning a machine, the operator can ensure that all
persistent resources are reclaimed. If rebooting a machine, the operator
can leave these resources allocated to the framework for when the machine
is back in the cluster.

Now, since we have the soft deadlines on inverse offers, a framework like
HDFS can determine when it can comply to inverse offers based on the global
data replication state (e.g. always ensure that 2/3 replicas of a block are
available). If relinquishing a particular data volume would mean that only
1 copy of a block is available, the framework can wait to comply with the
inverse offer, or can take steps to create more replicas.

One interesting question is how the resource expiry time will interact with
persistent resources, we may want to expose the expiry time at the resource
level rather than the offer level. Will think about this.

*However could you specify that when you drain a slave with hard:false you
> don't enter the drained state even when the deadline has passed if tasks
> are still running? This is not explicit in the document and we want to make
> sure operators have the information about this and could avoid unfortunate
> rolling restarts.*


This is explicit in the document under the soft deadline section: the
inverse offer will remain outstanding after the soft deadline elapses, we
won't forcibly drain the task. Anything that's not clear here?




On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <ma...@gmail.com>
wrote:

> Nice work!
>
> First question: don't you think that operations should differentiate short
> and long maintenance?
> I am thinking about frameworks that use persistent storage on disk for
> example. A short maintenance such as a slave reboot or upgrade could be
> done without moving the data to another slave. However decommissioning
> requires to drain the storage too.
>
> If you have an HDFS datanode with 50TB of (replicated) data, you might not
> want to drain it for a reboot (assuming your replication factor is high
> enough) since it takes ages. However for decommission it might make sense
> to drain it.
>
> Not sure if this is a good example but I feel the need to know if the
> maintenance is planned to be short or is forever. I know this does not fit
> the nice modeling you describe :-/
>
> Actually for HDFS we could define a threshold where "good enough"
> replication without the slave would be considered enough and thus we could
> deactivate the slave. This would prevent a rolling restart to go too fast.
> However could you specify that when you drain a slave with hard:false you
> don't enter the drained state even when the deadline has passed if tasks
> are still running? This is not explicit in the document and we want to make
> sure operators have the information about this and could avoid unfortunate
> rolling restarts.
>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
>> internship this past week. We worked together in the second half of her
>> internship to create a design document for maintenance primitives in Mesos
>> (the original ticket is MESOS-1474
>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>> document is the most up-to-date plan).
>>
>> Maintenance in this context consists of anything that requires the tasks
>> running on the slave to be killed (e.g. kernel upgrades, machine
>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>> etc).
>>
>> The desire is to expose maintenance events to frameworks in a generic
>> manner, as to allow frameworks to respect their SLAs, perform better task
>> placement, and migrate tasks if necessary.
>>
>> The design document is here:
>>
>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>
>> Please take a moment before the end of next week to go over this design. *Higher
>> level feedback and questions can be discussed most effectively in this
>> thread.*
>>
>> Let's thank Alexandra for her work!
>>
>> Ben
>>
>

Re: Design Review: Maintenance Primitives

Posted by Benjamin Mahler <be...@gmail.com>.
You're right, we don't account for that in the current design because such
a framework would be relying on disk resources outside of the sandbox.
Currently, we don't have a model for these "persistent" resources (e.g.
disk volume used for HDFS DataNode data). Unlike the existing resources,
persistent resources will not be tied to the lifecycle of the executor/task.

When we have a model for persistent resources, I can see this fitting into
the primitives we are proposing here. Since inverse offers work at the
resource level, we can provide control to the operators to determine
whether the persistent resources should be reclaimed from the framework as
part of the maintenance:

E.g. If decommissioning a machine, the operator can ensure that all
persistent resources are reclaimed. If rebooting a machine, the operator
can leave these resources allocated to the framework for when the machine
is back in the cluster.

Now, since we have the soft deadlines on inverse offers, a framework like
HDFS can determine when it can comply to inverse offers based on the global
data replication state (e.g. always ensure that 2/3 replicas of a block are
available). If relinquishing a particular data volume would mean that only
1 copy of a block is available, the framework can wait to comply with the
inverse offer, or can take steps to create more replicas.

One interesting question is how the resource expiry time will interact with
persistent resources, we may want to expose the expiry time at the resource
level rather than the offer level. Will think about this.

*However could you specify that when you drain a slave with hard:false you
> don't enter the drained state even when the deadline has passed if tasks
> are still running? This is not explicit in the document and we want to make
> sure operators have the information about this and could avoid unfortunate
> rolling restarts.*


This is explicit in the document under the soft deadline section: the
inverse offer will remain outstanding after the soft deadline elapses, we
won't forcibly drain the task. Anything that's not clear here?




On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <ma...@gmail.com>
wrote:

> Nice work!
>
> First question: don't you think that operations should differentiate short
> and long maintenance?
> I am thinking about frameworks that use persistent storage on disk for
> example. A short maintenance such as a slave reboot or upgrade could be
> done without moving the data to another slave. However decommissioning
> requires to drain the storage too.
>
> If you have an HDFS datanode with 50TB of (replicated) data, you might not
> want to drain it for a reboot (assuming your replication factor is high
> enough) since it takes ages. However for decommission it might make sense
> to drain it.
>
> Not sure if this is a good example but I feel the need to know if the
> maintenance is planned to be short or is forever. I know this does not fit
> the nice modeling you describe :-/
>
> Actually for HDFS we could define a threshold where "good enough"
> replication without the slave would be considered enough and thus we could
> deactivate the slave. This would prevent a rolling restart to go too fast.
> However could you specify that when you drain a slave with hard:false you
> don't enter the drained state even when the deadline has passed if tasks
> are still running? This is not explicit in the document and we want to make
> sure operators have the information about this and could avoid unfortunate
> rolling restarts.
>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
>> internship this past week. We worked together in the second half of her
>> internship to create a design document for maintenance primitives in Mesos
>> (the original ticket is MESOS-1474
>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>> document is the most up-to-date plan).
>>
>> Maintenance in this context consists of anything that requires the tasks
>> running on the slave to be killed (e.g. kernel upgrades, machine
>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>> etc).
>>
>> The desire is to expose maintenance events to frameworks in a generic
>> manner, as to allow frameworks to respect their SLAs, perform better task
>> placement, and migrate tasks if necessary.
>>
>> The design document is here:
>>
>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>
>> Please take a moment before the end of next week to go over this design. *Higher
>> level feedback and questions can be discussed most effectively in this
>> thread.*
>>
>> Let's thank Alexandra for her work!
>>
>> Ben
>>
>

Re: Design Review: Maintenance Primitives

Posted by Maxime Brugidou <ma...@gmail.com>.
Nice work!

First question: don't you think that operations should differentiate short
and long maintenance?
I am thinking about frameworks that use persistent storage on disk for
example. A short maintenance such as a slave reboot or upgrade could be
done without moving the data to another slave. However decommissioning
requires to drain the storage too.

If you have an HDFS datanode with 50TB of (replicated) data, you might not
want to drain it for a reboot (assuming your replication factor is high
enough) since it takes ages. However for decommission it might make sense
to drain it.

Not sure if this is a good example but I feel the need to know if the
maintenance is planned to be short or is forever. I know this does not fit
the nice modeling you describe :-/

Actually for HDFS we could define a threshold where "good enough"
replication without the slave would be considered enough and thus we could
deactivate the slave. This would prevent a rolling restart to go too fast.
However could you specify that when you drain a slave with hard:false you
don't enter the drained state even when the deadline has passed if tasks
are still running? This is not explicit in the document and we want to make
sure operators have the information about this and could avoid unfortunate
rolling restarts.
On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <be...@gmail.com>
wrote:

> Hi all,
>
> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
> internship this past week. We worked together in the second half of her
> internship to create a design document for maintenance primitives in Mesos
> (the original ticket is MESOS-1474
> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
> document is the most up-to-date plan).
>
> Maintenance in this context consists of anything that requires the tasks
> running on the slave to be killed (e.g. kernel upgrades, machine
> decommissioning, non-recoverable mesos upgrades / configuration changes,
> etc).
>
> The desire is to expose maintenance events to frameworks in a generic
> manner, as to allow frameworks to respect their SLAs, perform better task
> placement, and migrate tasks if necessary.
>
> The design document is here:
>
> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>
> Please take a moment before the end of next week to go over this design. *Higher
> level feedback and questions can be discussed most effectively in this
> thread.*
>
> Let's thank Alexandra for her work!
>
> Ben
>

Re: Design Review: Maintenance Primitives

Posted by Benjamin Mahler <be...@gmail.com>.
Now that persistent resources need to be considered, we revisited the
maintenance design to ensure persistent frameworks were accounted for. In
particular, in the updated design we allow operators to specify a
conservative estimate of the unavailability; useful for persistent
frameworks. There is no longer a split between the planned schedule and the
actual draining, also useful for persistent frameworks.

The updated high level design is here:
https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing

On Mon, Aug 25, 2014 at 12:24 PM, Benjamin Mahler <benjamin.mahler@gmail.com
> wrote:

> Hi all,
>
> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
> internship this past week. We worked together in the second half of her
> internship to create a design document for maintenance primitives in Mesos
> (the original ticket is MESOS-1474
> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
> document is the most up-to-date plan).
>
> Maintenance in this context consists of anything that requires the tasks
> running on the slave to be killed (e.g. kernel upgrades, machine
> decommissioning, non-recoverable mesos upgrades / configuration changes,
> etc).
>
> The desire is to expose maintenance events to frameworks in a generic
> manner, as to allow frameworks to respect their SLAs, perform better task
> placement, and migrate tasks if necessary.
>
> The design document is here:
>
> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>
> Please take a moment before the end of next week to go over this design. *Higher
> level feedback and questions can be discussed most effectively in this
> thread.*
>
> Let's thank Alexandra for her work!
>
> Ben
>

Re: Design Review: Maintenance Primitives

Posted by Benjamin Mahler <be...@gmail.com>.
Now that persistent resources need to be considered, we revisited the
maintenance design to ensure persistent frameworks were accounted for. In
particular, in the updated design we allow operators to specify a
conservative estimate of the unavailability; useful for persistent
frameworks. There is no longer a split between the planned schedule and the
actual draining, also useful for persistent frameworks.

The updated high level design is here:
https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing

On Mon, Aug 25, 2014 at 12:24 PM, Benjamin Mahler <benjamin.mahler@gmail.com
> wrote:

> Hi all,
>
> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
> internship this past week. We worked together in the second half of her
> internship to create a design document for maintenance primitives in Mesos
> (the original ticket is MESOS-1474
> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
> document is the most up-to-date plan).
>
> Maintenance in this context consists of anything that requires the tasks
> running on the slave to be killed (e.g. kernel upgrades, machine
> decommissioning, non-recoverable mesos upgrades / configuration changes,
> etc).
>
> The desire is to expose maintenance events to frameworks in a generic
> manner, as to allow frameworks to respect their SLAs, perform better task
> placement, and migrate tasks if necessary.
>
> The design document is here:
>
> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>
> Please take a moment before the end of next week to go over this design. *Higher
> level feedback and questions can be discussed most effectively in this
> thread.*
>
> Let's thank Alexandra for her work!
>
> Ben
>