You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by "Heer, Christoph" <ch...@sap.com> on 2019/06/07 19:24:19 UTC

Re: Design doc: Agent draining and deprecation of maintenance primitives

Hi everyone,

my team and I implemented our own Mesos framework for task execution on our bare-metal on-prem cluster.
Especially for task processing workload with known or estimated task duration, the available Mesos maintenance primitives are super powerful for scheduler and operators. While developing the scheduler, I hadn't the feeling it would be complex to support/respect maintenance windows. Already the small logic "Should I launch task X with estimated runtime 3h on node Y with scheduled maintenance in 40min?" saved us tons of aborted tasks. Our hardware operations team also really likes the way to plan and express maintenance windows upfront. Days before the actually maintenance they can add the information and the node will be ready at that point in time. Also, they can reboot the machines without the fear that any production workload will be scheduled until they confirmed the end of the maintenance. But looks like this would be also ensured by the new design.

In the past we already used another job orchestration system with a draining approach similar to the design proposal. In nearly all cases the operations team didn't manage to start the draining mode at the right time. Either it was too early, and we didn't use available hardware resources or it was too late and it unnecessarily interrupted productive workload. Especially for long-running tasks which are expensive at restarting, it wasn't a good way to mange scheduled down times.

I don't know the implementation within Mesos and therefore I can't judge about the complexity but I think the main problem is that Mesos doesn't provide an intuitive interface for managing maintenance windows. The HTTP API isn't that complicated but you definitely need own or external tooling. Probably most people are already deterred from the JSON syntax with nanoseconds. Also, the lack of synchronisation of modifications can be a problem and makes it harder to implement tooling around the API. A new more fine-grain HTTP API would be a big improvement and would allow to implement a nice looking interface within the Mesos UI.

It would be sad to see this great feature disappearing.

Best regards,
Christoph


Christoph Heer
SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany

Mandatory Disclosure Statement: www.sap.com/impressum
This e-mail may contain trade secrets or privileged, undisclosed, or otherwise 
confidential information. If you have received this e-mail in error, you are hereby 
notified that any review, copying, or distribution of it is strictly prohibited. Please inform 
us immediately and destroy the original transmittal. Thank you for your cooperation.


> On 7. Jun 2019, at 09:56, Maxime Brugidou <ma...@gmail.com> wrote:
> 
> I think that you are both correct about the fact that most users don't and won't use the schedules to plan maintenance in advance. The main reason is that frameworks just don't use this schedule and don't take inverse offers into account, but also most use cases are Ok with simply draining nodes one after the other without more logic.
> 
> In the end Benjamin is right we are always hitting the same problem with Mesos, there is no good reference implementation of a scheduler, with all the features baked in. On our side we are mostly using open source schedulers (Marathon, Aurora, Flink...etc) for various use cases and they mostly don't leverage maintenance primitives. We started to use them for one custom use case where we are indeed building our own framework, and we want to provide some sort of "task duration" SLA which would clearly benefit from maintenance schedules. Honestly if we are the only users doing that, we can maintain the schedules on a separate service easily. I haven't seen any framework actually using the inverse offers though.
> 
> I also agree that adding offer ordering in the allocator is probably not the best design since what we want in the end is probably some affinity/anti-affinity at the scheduler level based on the "time to reboot" for example. But again, this need cooperation from frameworks. My idea was more of a hack/prototype idea since I see that slaves are randomly sorted in the allocator and we could easily patch it to have a custom sort mechanism. But I completely agree that optimistic offers or similar techniques are the way to go.
> 
> I don't think that we will ever get to the point of having a reference scheduler, the Mesos community would need to agree on one implementation and make sure that every new feature of Mesos gets implemented in the scheduler. This is a huge amount of work and coordination/design. The mesosphere dcos-commons library is one example of the complexity of such a project, it is dedicated to stateful services, is clearly coupled with DC/OS (although we are able to use it on bare Mesos too), and it's still difficult to use. However, having an open source scheduler exposing a higher-level friendly API via RPC (like kubernetes for example), is probably the only way to make Mesos more accessible for most users.
> 
> On Fri, Jun 7, 2019 at 6:24 AM Benjamin Mahler <bm...@apache.org> wrote:
> > With the new proposal, it's going to be as difficult as before to have SLA-aware maintenances because it will need cooperation from the frameworks anyway and we know this is rarely a priority for them. We will also lose the ability to signal future maintenance in order to optimize allocations.
> 
> Personally, I think right now we should solve the basic need of draining a node. The plan to add SLA-awareness into draining was to introduce a capability that schedulers opt into that enables them to (1) take control over the killing of tasks when an agent is put into the draining state and (2) still get offers when an agent is the draining state in case the scheduler needs to restart a task that *must* run. This allows an SLA-aware scheduler to avoid killing during a drain if its task(s) will have SLAs violated.
> 
> Perhaps this functionality can live alongside the maintenance schedule information we currently support, without being coupled together. As far as I'm aware that's something we hadn't considered (we considered integrating into the maintenance schedules or replacing them).
> 
> > For example I had this idea to improve the allocator (or write a custom one) that would offer resources from agents with no maintenance planned in priority, and then sort agents by maintenance date in decremasing order.
> 
> Right now there is no meaning to the order of offers. Adding some meaning to the ordering of offers quickly becomes an issue for us as soon as there are multiple criteria that need to be evaluated. For example, if you want to incorporate maintenance, load spreading, fault domain spreading, etc across machines, it becomes less clear how offers should be ordered. One could try to build some scoring model in mesos for ordering, but it will be woefully inadequate since Mesos does not know anything about the pending workloads: it's ultimately the schedulers that are best positioned to make these decisions. This is why we are going to move towards an "optimistic concurrency" model where schedulers can choose what they want and Mesos enforces constraints (e.g. quota limits), thereby eliminating the multi-scheduler scalability issues of the current offer model.
> 
> And as somewhat of an aside, the lack of built-in scheduling has been bad for the Mesos ecosystem. The vast majority of users just need to schedule: services, jobs and cron jobs. These have a pretty standard look and feel (including the SLA aspect of them!). Many of the existing schedulers could be thinner "orchestrators" that know when to submit something to be scheduled by a common scheduler, rather than reimplementing all of the typical scheduling primitives (constraints, SLA awareness, dealing with the low level mesos scheduling API). My point here is that we ask too much of frameworks and it hurts users. I would love to see scheduling become more standardized and built into Mesos.
> 
> On Thu, Jun 6, 2019 at 10:52 AM Greg Mann <gr...@mesosphere.io> wrote:
> Maxime,
> Thanks for the feedback, it's much appreciated. I agree that it would be possible to evolve the existing primitives to accomplish something similar to the proposal. That is one option that was considered before writing the design doc, but after some discussion, I thought that it seems more appropriate to start over with a simpler model that accomplishes what we perceive to be the predominant use case: the automated draining of agent nodes, without the concept of a maintenance window or designated maintenance time in the future. However, perhaps this perception is incorrect?
> 
> Using maintenance metadata to alter the sorting order in the allocator is an interesting idea; currently, the allocator does not have access to information about maintenance, but it's conceivable that we could extend the allocator interface to accommodate this. While the currently-proposed design would not allow this, it would allow operators to deactivate nodes, which is an extreme version of this, since deactivated agents would never have their resources offered to frameworks. This provides a blunt mechanism to prevent scheduling on nodes which have upcoming maintenance, although it sounds like you see some benefit to a more subtle notion of scheduling priority based on upcoming maintenance? Do you think that maintenance-aware sorting would provide much more benefit to you over agent deactivation? Do you make use of the existing maintenance primitives to signal upcoming maintenance on agents?
> 
> Thanks!
> Greg
> 
> On Thu, Jun 6, 2019 at 9:37 AM Maxime Brugidou <ma...@gmail.com> wrote:
> Hi,
> 
> As a Mesos operator, I am really surprised by this proposal.
> 
> The main advantage of the proposed design is that we can finally set nodes down for maintenance with a configurable kill grace period and a proper task status (with maintenance primitives, it was TASK_LOST I think) without any specific cooperation from the frameworks.
> 
> I think that this could be just an evolution of the current primitives.
> 
> With the new proposal, it's going to be as difficult as before to have SLA-aware maintenances because it will need cooperation from the frameworks anyway and we know this is rarely a priority for them. We will also lose the ability to signal future maintenance in order to optimize allocations.
> 
> For example I had this idea to improve the allocator (or write a custom one) that would offer resources from agents with no maintenance planned in priority, and then sort agents by maintenance date in decremasing order. This would be a big improvement to prevent cluster reboots to trigger too many task restarts. This will not be possible with the new primitives. The same idea apply for frameworks too.
> 
> Maxime
> 
> Le jeu. 30 mai 2019 à 22:16, Joseph Wu <jo...@mesosphere.io> a écrit :
> As far as I can tell, the document is public.
> 
> On Thu, May 30, 2019 at 12:22 AM Marc Roos <M....@f1-outsourcing.eu> wrote:
>  
> Is the doc not public?
> 
> 
> -----Original Message-----
> From: Joseph Wu [mailto:joseph@mesosphere.io] 
> Sent: donderdag 30 mei 2019 2:07
> To: dev; user
> Subject: Design doc: Agent draining and deprecation of maintenance 
> primitives
> 
> Hi all,
> 
> A few years back, we added some constructs called maintenance primitives 
> to Mesos.  This feature was meant to allow operators and frameworks to 
> cooperate in draining tasks off nodes scheduled for maintenance.  As far 
> as we've observed since, this feature never achieved enough adoption to 
> be useful for operators.
> 
> As such, we are proposing a more opinionated approach for draining 
> tasks.  The goal is to have Mesos perform draining in lieu of 
> frameworks, minimizing or eliminating the need to change frameworks to 
> account for draining.  We will also be simplifying the operator 
> workflow, which would only require a single call (holding an AgentID) to 
> start draining; and a single call to bring an agent back into the 
> cluster.
> 
> 
> Due to how closely this proposed feature overlaps with maintenance 
> primitives, we will be deprecating maintenance primitives upon 
> implementation of agent draining.
> 
> 
> If interested, please take a look at the design document:
> 
> https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
> 
>

Re: Design doc: Agent draining and deprecation of maintenance primitives

Posted by Vinod Kone <vi...@gmail.com>.

+1

Thanks,
Vinod

> On Jun 14, 2019, at 9:18 AM, Greg Mann <gr...@mesosphere.io> wrote:
> 
> Hi all,
> Myself and a few other committers spent some time revisiting the possibility of implementing agent draining using maintenance windows, as well as discussing the coexistence of the existing maintenance primitives with the agent draining feature as it is currently designed. Ultimately, the use case of an operator putting an agent into a draining state immediately and indefinitely, with no concept of a maintenance window, seems to be valid. That use case is a bit awkward to represent in terms of our existing maintenance windows. So, our thought is that we can add the agent draining feature as it is currently designed, in order to provide an automatic agent draining primitive. We can then later on extend the maintenance schedules to allow operators to specify that they would like to automatically drain agents leading up to the maintenance window. At that point, we could make use of the agent draining primitive to accomplish this.
> 
> For the time being, we would like to disallow any single agent from both being present in the maintenance schedule and being put into an automatic draining state. This gives us some time to figure out precisely how these two features will interact so that we avoid the need to make breaking changes down the road.
> 
> Let me know what you all think of the above plan. I like it because it allows operators who are currently using the maintenance primitives to continue doing so, accommodates the simple case of immediate agent draining in the near future, and allows us to incorporate automatic draining into the maintenance schedule later.
> 
> Cheers,
> Greg
> 
>> On Fri, Jun 14, 2019 at 4:18 PM Greg Mann <gr...@mesosphere.io> wrote:
>> Christoph,
>> Great to hear that you're using the maintenance primitives! It seems unwise for us to deprecate this part of the API given the fact that you and Maxime have both expressed a desire for it to stick around. I'll adjust the agent draining design doc to remove the deprecation of that feature. Many thanks for your feedback.
>> 
>> Greg
>> 
>>> On Fri, Jun 7, 2019 at 9:24 PM Heer, Christoph <ch...@sap.com> wrote:
>>> Hi everyone,
>>> 
>>> my team and I implemented our own Mesos framework for task execution on our bare-metal on-prem cluster.
>>> Especially for task processing workload with known or estimated task duration, the available Mesos maintenance primitives are super powerful for scheduler and operators. While developing the scheduler, I hadn't the feeling it would be complex to support/respect maintenance windows. Already the small logic "Should I launch task X with estimated runtime 3h on node Y with scheduled maintenance in 40min?" saved us tons of aborted tasks. Our hardware operations team also really likes the way to plan and express maintenance windows upfront. Days before the actually maintenance they can add the information and the node will be ready at that point in time. Also, they can reboot the machines without the fear that any production workload will be scheduled until they confirmed the end of the maintenance. But looks like this would be also ensured by the new design.
>>> 
>>> In the past we already used another job orchestration system with a draining approach similar to the design proposal. In nearly all cases the operations team didn't manage to start the draining mode at the right time. Either it was too early, and we didn't use available hardware resources or it was too late and it unnecessarily interrupted productive workload. Especially for long-running tasks which are expensive at restarting, it wasn't a good way to mange scheduled down times.
>>> 
>>> I don't know the implementation within Mesos and therefore I can't judge about the complexity but I think the main problem is that Mesos doesn't provide an intuitive interface for managing maintenance windows. The HTTP API isn't that complicated but you definitely need own or external tooling. Probably most people are already deterred from the JSON syntax with nanoseconds. Also, the lack of synchronisation of modifications can be a problem and makes it harder to implement tooling around the API. A new more fine-grain HTTP API would be a big improvement and would allow to implement a nice looking interface within the Mesos UI.
>>> 
>>> It would be sad to see this great feature disappearing.
>>> 
>>> Best regards,
>>> Christoph
>>> 
>>> 
>>> Christoph Heer
>>> SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany
>>> 
>>> Mandatory Disclosure Statement: www.sap.com/impressum
>>> This e-mail may contain trade secrets or privileged, undisclosed, or otherwise 
>>> confidential information. If you have received this e-mail in error, you are hereby 
>>> notified that any review, copying, or distribution of it is strictly prohibited. Please inform 
>>> us immediately and destroy the original transmittal. Thank you for your cooperation.
>>> 
>>> 
>>> > On 7. Jun 2019, at 09:56, Maxime Brugidou <ma...@gmail.com> wrote:
>>> > 
>>> > I think that you are both correct about the fact that most users don't and won't use the schedules to plan maintenance in advance. The main reason is that frameworks just don't use this schedule and don't take inverse offers into account, but also most use cases are Ok with simply draining nodes one after the other without more logic.
>>> > 
>>> > In the end Benjamin is right we are always hitting the same problem with Mesos, there is no good reference implementation of a scheduler, with all the features baked in. On our side we are mostly using open source schedulers (Marathon, Aurora, Flink...etc) for various use cases and they mostly don't leverage maintenance primitives. We started to use them for one custom use case where we are indeed building our own framework, and we want to provide some sort of "task duration" SLA which would clearly benefit from maintenance schedules. Honestly if we are the only users doing that, we can maintain the schedules on a separate service easily. I haven't seen any framework actually using the inverse offers though.
>>> > 
>>> > I also agree that adding offer ordering in the allocator is probably not the best design since what we want in the end is probably some affinity/anti-affinity at the scheduler level based on the "time to reboot" for example. But again, this need cooperation from frameworks. My idea was more of a hack/prototype idea since I see that slaves are randomly sorted in the allocator and we could easily patch it to have a custom sort mechanism. But I completely agree that optimistic offers or similar techniques are the way to go.
>>> > 
>>> > I don't think that we will ever get to the point of having a reference scheduler, the Mesos community would need to agree on one implementation and make sure that every new feature of Mesos gets implemented in the scheduler. This is a huge amount of work and coordination/design. The mesosphere dcos-commons library is one example of the complexity of such a project, it is dedicated to stateful services, is clearly coupled with DC/OS (although we are able to use it on bare Mesos too), and it's still difficult to use. However, having an open source scheduler exposing a higher-level friendly API via RPC (like kubernetes for example), is probably the only way to make Mesos more accessible for most users.
>>> > 
>>> > On Fri, Jun 7, 2019 at 6:24 AM Benjamin Mahler <bm...@apache.org> wrote:
>>> > > With the new proposal, it's going to be as difficult as before to have SLA-aware maintenances because it will need cooperation from the frameworks anyway and we know this is rarely a priority for them. We will also lose the ability to signal future maintenance in order to optimize allocations.
>>> > 
>>> > Personally, I think right now we should solve the basic need of draining a node. The plan to add SLA-awareness into draining was to introduce a capability that schedulers opt into that enables them to (1) take control over the killing of tasks when an agent is put into the draining state and (2) still get offers when an agent is the draining state in case the scheduler needs to restart a task that *must* run. This allows an SLA-aware scheduler to avoid killing during a drain if its task(s) will have SLAs violated.
>>> > 
>>> > Perhaps this functionality can live alongside the maintenance schedule information we currently support, without being coupled together. As far as I'm aware that's something we hadn't considered (we considered integrating into the maintenance schedules or replacing them).
>>> > 
>>> > > For example I had this idea to improve the allocator (or write a custom one) that would offer resources from agents with no maintenance planned in priority, and then sort agents by maintenance date in decremasing order.
>>> > 
>>> > Right now there is no meaning to the order of offers. Adding some meaning to the ordering of offers quickly becomes an issue for us as soon as there are multiple criteria that need to be evaluated. For example, if you want to incorporate maintenance, load spreading, fault domain spreading, etc across machines, it becomes less clear how offers should be ordered. One could try to build some scoring model in mesos for ordering, but it will be woefully inadequate since Mesos does not know anything about the pending workloads: it's ultimately the schedulers that are best positioned to make these decisions. This is why we are going to move towards an "optimistic concurrency" model where schedulers can choose what they want and Mesos enforces constraints (e.g. quota limits), thereby eliminating the multi-scheduler scalability issues of the current offer model.
>>> > 
>>> > And as somewhat of an aside, the lack of built-in scheduling has been bad for the Mesos ecosystem. The vast majority of users just need to schedule: services, jobs and cron jobs. These have a pretty standard look and feel (including the SLA aspect of them!). Many of the existing schedulers could be thinner "orchestrators" that know when to submit something to be scheduled by a common scheduler, rather than reimplementing all of the typical scheduling primitives (constraints, SLA awareness, dealing with the low level mesos scheduling API). My point here is that we ask too much of frameworks and it hurts users. I would love to see scheduling become more standardized and built into Mesos.
>>> > 
>>> > On Thu, Jun 6, 2019 at 10:52 AM Greg Mann <gr...@mesosphere.io> wrote:
>>> > Maxime,
>>> > Thanks for the feedback, it's much appreciated. I agree that it would be possible to evolve the existing primitives to accomplish something similar to the proposal. That is one option that was considered before writing the design doc, but after some discussion, I thought that it seems more appropriate to start over with a simpler model that accomplishes what we perceive to be the predominant use case: the automated draining of agent nodes, without the concept of a maintenance window or designated maintenance time in the future. However, perhaps this perception is incorrect?
>>> > 
>>> > Using maintenance metadata to alter the sorting order in the allocator is an interesting idea; currently, the allocator does not have access to information about maintenance, but it's conceivable that we could extend the allocator interface to accommodate this. While the currently-proposed design would not allow this, it would allow operators to deactivate nodes, which is an extreme version of this, since deactivated agents would never have their resources offered to frameworks. This provides a blunt mechanism to prevent scheduling on nodes which have upcoming maintenance, although it sounds like you see some benefit to a more subtle notion of scheduling priority based on upcoming maintenance? Do you think that maintenance-aware sorting would provide much more benefit to you over agent deactivation? Do you make use of the existing maintenance primitives to signal upcoming maintenance on agents?
>>> > 
>>> > Thanks!
>>> > Greg
>>> > 
>>> > On Thu, Jun 6, 2019 at 9:37 AM Maxime Brugidou <ma...@gmail.com> wrote:
>>> > Hi,
>>> > 
>>> > As a Mesos operator, I am really surprised by this proposal.
>>> > 
>>> > The main advantage of the proposed design is that we can finally set nodes down for maintenance with a configurable kill grace period and a proper task status (with maintenance primitives, it was TASK_LOST I think) without any specific cooperation from the frameworks.
>>> > 
>>> > I think that this could be just an evolution of the current primitives.
>>> > 
>>> > With the new proposal, it's going to be as difficult as before to have SLA-aware maintenances because it will need cooperation from the frameworks anyway and we know this is rarely a priority for them. We will also lose the ability to signal future maintenance in order to optimize allocations.
>>> > 
>>> > For example I had this idea to improve the allocator (or write a custom one) that would offer resources from agents with no maintenance planned in priority, and then sort agents by maintenance date in decremasing order. This would be a big improvement to prevent cluster reboots to trigger too many task restarts. This will not be possible with the new primitives. The same idea apply for frameworks too.
>>> > 
>>> > Maxime
>>> > 
>>> > Le jeu. 30 mai 2019 à 22:16, Joseph Wu <jo...@mesosphere.io> a écrit :
>>> > As far as I can tell, the document is public.
>>> > 
>>> > On Thu, May 30, 2019 at 12:22 AM Marc Roos <M....@f1-outsourcing.eu> wrote:
>>> >  
>>> > Is the doc not public?
>>> > 
>>> > 
>>> > -----Original Message-----
>>> > From: Joseph Wu [mailto:joseph@mesosphere.io] 
>>> > Sent: donderdag 30 mei 2019 2:07
>>> > To: dev; user
>>> > Subject: Design doc: Agent draining and deprecation of maintenance 
>>> > primitives
>>> > 
>>> > Hi all,
>>> > 
>>> > A few years back, we added some constructs called maintenance primitives 
>>> > to Mesos.  This feature was meant to allow operators and frameworks to 
>>> > cooperate in draining tasks off nodes scheduled for maintenance.  As far 
>>> > as we've observed since, this feature never achieved enough adoption to 
>>> > be useful for operators.
>>> > 
>>> > As such, we are proposing a more opinionated approach for draining 
>>> > tasks.  The goal is to have Mesos perform draining in lieu of 
>>> > frameworks, minimizing or eliminating the need to change frameworks to 
>>> > account for draining.  We will also be simplifying the operator 
>>> > workflow, which would only require a single call (holding an AgentID) to 
>>> > start draining; and a single call to bring an agent back into the 
>>> > cluster.
>>> > 
>>> > 
>>> > Due to how closely this proposed feature overlaps with maintenance 
>>> > primitives, we will be deprecating maintenance primitives upon 
>>> > implementation of agent draining.
>>> > 
>>> > 
>>> > If interested, please take a look at the design document:
>>> > 
>>> > https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
>>> > 
>>> > 
>>>

Re: Design doc: Agent draining and deprecation of maintenance primitives

Posted by Greg Mann <gr...@mesosphere.io>.

Hi all,
Myself and a few other committers spent some time revisiting the
possibility of implementing agent draining using maintenance windows, as
well as discussing the coexistence of the existing maintenance primitives
with the agent draining feature as it is currently designed. Ultimately,
the use case of an operator putting an agent into a draining state
immediately and indefinitely, with no concept of a maintenance window,
seems to be valid. That use case is a bit awkward to represent in terms of
our existing maintenance windows. So, our thought is that we can add the
agent draining feature as it is currently designed, in order to provide an
automatic agent draining primitive. We can then later on extend the
maintenance schedules to allow operators to specify that they would like to
automatically drain agents leading up to the maintenance window. At that
point, we could make use of the agent draining primitive to accomplish this.

For the time being, we would like to disallow any single agent from both
being present in the maintenance schedule and being put into an automatic
draining state. This gives us some time to figure out precisely how these
two features will interact so that we avoid the need to make breaking
changes down the road.

Let me know what you all think of the above plan. I like it because it
allows operators who are currently using the maintenance primitives to
continue doing so, accommodates the simple case of immediate agent draining
in the near future, and allows us to incorporate automatic draining into
the maintenance schedule later.

Cheers,
Greg

On Fri, Jun 14, 2019 at 4:18 PM Greg Mann <gr...@mesosphere.io> wrote:

> Christoph,
> Great to hear that you're using the maintenance primitives! It seems
> unwise for us to deprecate this part of the API given the fact that you and
> Maxime have both expressed a desire for it to stick around. I'll adjust the
> agent draining design doc to remove the deprecation of that feature. Many
> thanks for your feedback.
>
> Greg
>
> On Fri, Jun 7, 2019 at 9:24 PM Heer, Christoph <ch...@sap.com>
> wrote:
>
>> Hi everyone,
>>
>> my team and I implemented our own Mesos framework for task execution on
>> our bare-metal on-prem cluster.
>> Especially for task processing workload with known or estimated task
>> duration, the available Mesos maintenance primitives are super powerful for
>> scheduler and operators. While developing the scheduler, I hadn't the
>> feeling it would be complex to support/respect maintenance windows. Already
>> the small logic "Should I launch task X with estimated runtime 3h on node Y
>> with scheduled maintenance in 40min?" saved us tons of aborted tasks. Our
>> hardware operations team also really likes the way to plan and express
>> maintenance windows upfront. Days before the actually maintenance they can
>> add the information and the node will be ready at that point in time. Also,
>> they can reboot the machines without the fear that any production workload
>> will be scheduled until they confirmed the end of the maintenance. But
>> looks like this would be also ensured by the new design.
>>
>> In the past we already used another job orchestration system with a
>> draining approach similar to the design proposal. In nearly all cases the
>> operations team didn't manage to start the draining mode at the right time.
>> Either it was too early, and we didn't use available hardware resources or
>> it was too late and it unnecessarily interrupted productive workload.
>> Especially for long-running tasks which are expensive at restarting, it
>> wasn't a good way to mange scheduled down times.
>>
>> I don't know the implementation within Mesos and therefore I can't judge
>> about the complexity but I think the main problem is that Mesos doesn't
>> provide an intuitive interface for managing maintenance windows. The HTTP
>> API isn't that complicated but you definitely need own or external tooling.
>> Probably most people are already deterred from the JSON syntax with
>> nanoseconds. Also, the lack of synchronisation of modifications can be a
>> problem and makes it harder to implement tooling around the API. A new more
>> fine-grain HTTP API would be a big improvement and would allow to implement
>> a nice looking interface within the Mesos UI.
>>
>> It would be sad to see this great feature disappearing.
>>
>> Best regards,
>> Christoph
>>
>>
>> Christoph Heer
>> SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany
>>
>> Mandatory Disclosure Statement: www.sap.com/impressum
>> This e-mail may contain trade secrets or privileged, undisclosed, or
>> otherwise
>> confidential information. If you have received this e-mail in error, you
>> are hereby
>> notified that any review, copying, or distribution of it is strictly
>> prohibited. Please inform
>> us immediately and destroy the original transmittal. Thank you for your
>> cooperation.
>>
>>
>> > On 7. Jun 2019, at 09:56, Maxime Brugidou <ma...@gmail.com>
>> wrote:
>> >
>> > I think that you are both correct about the fact that most users don't
>> and won't use the schedules to plan maintenance in advance. The main reason
>> is that frameworks just don't use this schedule and don't take inverse
>> offers into account, but also most use cases are Ok with simply draining
>> nodes one after the other without more logic.
>> >
>> > In the end Benjamin is right we are always hitting the same problem
>> with Mesos, there is no good reference implementation of a scheduler, with
>> all the features baked in. On our side we are mostly using open source
>> schedulers (Marathon, Aurora, Flink...etc) for various use cases and they
>> mostly don't leverage maintenance primitives. We started to use them for
>> one custom use case where we are indeed building our own framework, and we
>> want to provide some sort of "task duration" SLA which would clearly
>> benefit from maintenance schedules. Honestly if we are the only users doing
>> that, we can maintain the schedules on a separate service easily. I haven't
>> seen any framework actually using the inverse offers though.
>> >
>> > I also agree that adding offer ordering in the allocator is probably
>> not the best design since what we want in the end is probably some
>> affinity/anti-affinity at the scheduler level based on the "time to reboot"
>> for example. But again, this need cooperation from frameworks. My idea was
>> more of a hack/prototype idea since I see that slaves are randomly sorted
>> in the allocator and we could easily patch it to have a custom sort
>> mechanism. But I completely agree that optimistic offers or similar
>> techniques are the way to go.
>> >
>> > I don't think that we will ever get to the point of having a reference
>> scheduler, the Mesos community would need to agree on one implementation
>> and make sure that every new feature of Mesos gets implemented in the
>> scheduler. This is a huge amount of work and coordination/design. The
>> mesosphere dcos-commons library is one example of the complexity of such a
>> project, it is dedicated to stateful services, is clearly coupled with
>> DC/OS (although we are able to use it on bare Mesos too), and it's still
>> difficult to use. However, having an open source scheduler exposing a
>> higher-level friendly API via RPC (like kubernetes for example), is
>> probably the only way to make Mesos more accessible for most users.
>> >
>> > On Fri, Jun 7, 2019 at 6:24 AM Benjamin Mahler <bm...@apache.org>
>> wrote:
>> > > With the new proposal, it's going to be as difficult as before to
>> have SLA-aware maintenances because it will need cooperation from the
>> frameworks anyway and we know this is rarely a priority for them. We will
>> also lose the ability to signal future maintenance in order to optimize
>> allocations.
>> >
>> > Personally, I think right now we should solve the basic need of
>> draining a node. The plan to add SLA-awareness into draining was to
>> introduce a capability that schedulers opt into that enables them to (1)
>> take control over the killing of tasks when an agent is put into the
>> draining state and (2) still get offers when an agent is the draining state
>> in case the scheduler needs to restart a task that *must* run. This allows
>> an SLA-aware scheduler to avoid killing during a drain if its task(s) will
>> have SLAs violated.
>> >
>> > Perhaps this functionality can live alongside the maintenance schedule
>> information we currently support, without being coupled together. As far as
>> I'm aware that's something we hadn't considered (we considered integrating
>> into the maintenance schedules or replacing them).
>> >
>> > > For example I had this idea to improve the allocator (or write a
>> custom one) that would offer resources from agents with no maintenance
>> planned in priority, and then sort agents by maintenance date in
>> decremasing order.
>> >
>> > Right now there is no meaning to the order of offers. Adding some
>> meaning to the ordering of offers quickly becomes an issue for us as soon
>> as there are multiple criteria that need to be evaluated. For example, if
>> you want to incorporate maintenance, load spreading, fault domain
>> spreading, etc across machines, it becomes less clear how offers should be
>> ordered. One could try to build some scoring model in mesos for ordering,
>> but it will be woefully inadequate since Mesos does not know anything about
>> the pending workloads: it's ultimately the schedulers that are best
>> positioned to make these decisions. This is why we are going to move
>> towards an "optimistic concurrency" model where schedulers can choose what
>> they want and Mesos enforces constraints (e.g. quota limits), thereby
>> eliminating the multi-scheduler scalability issues of the current offer
>> model.
>> >
>> > And as somewhat of an aside, the lack of built-in scheduling has been
>> bad for the Mesos ecosystem. The vast majority of users just need to
>> schedule: services, jobs and cron jobs. These have a pretty standard look
>> and feel (including the SLA aspect of them!). Many of the existing
>> schedulers could be thinner "orchestrators" that know when to submit
>> something to be scheduled by a common scheduler, rather than reimplementing
>> all of the typical scheduling primitives (constraints, SLA awareness,
>> dealing with the low level mesos scheduling API). My point here is that we
>> ask too much of frameworks and it hurts users. I would love to see
>> scheduling become more standardized and built into Mesos.
>> >
>> > On Thu, Jun 6, 2019 at 10:52 AM Greg Mann <gr...@mesosphere.io> wrote:
>> > Maxime,
>> > Thanks for the feedback, it's much appreciated. I agree that it would
>> be possible to evolve the existing primitives to accomplish something
>> similar to the proposal. That is one option that was considered before
>> writing the design doc, but after some discussion, I thought that it seems
>> more appropriate to start over with a simpler model that accomplishes what
>> we perceive to be the predominant use case: the automated draining of agent
>> nodes, without the concept of a maintenance window or designated
>> maintenance time in the future. However, perhaps this perception is
>> incorrect?
>> >
>> > Using maintenance metadata to alter the sorting order in the allocator
>> is an interesting idea; currently, the allocator does not have access to
>> information about maintenance, but it's conceivable that we could extend
>> the allocator interface to accommodate this. While the currently-proposed
>> design would not allow this, it would allow operators to deactivate nodes,
>> which is an extreme version of this, since deactivated agents would never
>> have their resources offered to frameworks. This provides a blunt mechanism
>> to prevent scheduling on nodes which have upcoming maintenance, although it
>> sounds like you see some benefit to a more subtle notion of scheduling
>> priority based on upcoming maintenance? Do you think that maintenance-aware
>> sorting would provide much more benefit to you over agent deactivation? Do
>> you make use of the existing maintenance primitives to signal upcoming
>> maintenance on agents?
>> >
>> > Thanks!
>> > Greg
>> >
>> > On Thu, Jun 6, 2019 at 9:37 AM Maxime Brugidou <
>> maxime.brugidou@gmail.com> wrote:
>> > Hi,
>> >
>> > As a Mesos operator, I am really surprised by this proposal.
>> >
>> > The main advantage of the proposed design is that we can finally set
>> nodes down for maintenance with a configurable kill grace period and a
>> proper task status (with maintenance primitives, it was TASK_LOST I think)
>> without any specific cooperation from the frameworks.
>> >
>> > I think that this could be just an evolution of the current primitives.
>> >
>> > With the new proposal, it's going to be as difficult as before to have
>> SLA-aware maintenances because it will need cooperation from the frameworks
>> anyway and we know this is rarely a priority for them. We will also lose
>> the ability to signal future maintenance in order to optimize allocations.
>> >
>> > For example I had this idea to improve the allocator (or write a custom
>> one) that would offer resources from agents with no maintenance planned in
>> priority, and then sort agents by maintenance date in decremasing order.
>> This would be a big improvement to prevent cluster reboots to trigger too
>> many task restarts. This will not be possible with the new primitives. The
>> same idea apply for frameworks too.
>> >
>> > Maxime
>> >
>> > Le jeu. 30 mai 2019 à 22:16, Joseph Wu <jo...@mesosphere.io> a écrit :
>> > As far as I can tell, the document is public.
>> >
>> > On Thu, May 30, 2019 at 12:22 AM Marc Roos <M....@f1-outsourcing.eu>
>> wrote:
>> >
>> > Is the doc not public?
>> >
>> >
>> > -----Original Message-----
>> > From: Joseph Wu [mailto:joseph@mesosphere.io]
>> > Sent: donderdag 30 mei 2019 2:07
>> > To: dev; user
>> > Subject: Design doc: Agent draining and deprecation of maintenance
>> > primitives
>> >
>> > Hi all,
>> >
>> > A few years back, we added some constructs called maintenance
>> primitives
>> > to Mesos.  This feature was meant to allow operators and frameworks to
>> > cooperate in draining tasks off nodes scheduled for maintenance.  As
>> far
>> > as we've observed since, this feature never achieved enough adoption to
>> > be useful for operators.
>> >
>> > As such, we are proposing a more opinionated approach for draining
>> > tasks.  The goal is to have Mesos perform draining in lieu of
>> > frameworks, minimizing or eliminating the need to change frameworks to
>> > account for draining.  We will also be simplifying the operator
>> > workflow, which would only require a single call (holding an AgentID)
>> to
>> > start draining; and a single call to bring an agent back into the
>> > cluster.
>> >
>> >
>> > Due to how closely this proposed feature overlaps with maintenance
>> > primitives, we will be deprecating maintenance primitives upon
>> > implementation of agent draining.
>> >
>> >
>> > If interested, please take a look at the design document:
>> >
>> >
>> https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
>> >
>> >
>>
>>

Re: Design doc: Agent draining and deprecation of maintenance primitives

Posted by Greg Mann <gr...@mesosphere.io>.

Christoph,
Great to hear that you're using the maintenance primitives! It seems unwise
for us to deprecate this part of the API given the fact that you and Maxime
have both expressed a desire for it to stick around. I'll adjust the agent
draining design doc to remove the deprecation of that feature. Many thanks
for your feedback.

Greg

On Fri, Jun 7, 2019 at 9:24 PM Heer, Christoph <ch...@sap.com>
wrote:

> Hi everyone,
>
> my team and I implemented our own Mesos framework for task execution on
> our bare-metal on-prem cluster.
> Especially for task processing workload with known or estimated task
> duration, the available Mesos maintenance primitives are super powerful for
> scheduler and operators. While developing the scheduler, I hadn't the
> feeling it would be complex to support/respect maintenance windows. Already
> the small logic "Should I launch task X with estimated runtime 3h on node Y
> with scheduled maintenance in 40min?" saved us tons of aborted tasks. Our
> hardware operations team also really likes the way to plan and express
> maintenance windows upfront. Days before the actually maintenance they can
> add the information and the node will be ready at that point in time. Also,
> they can reboot the machines without the fear that any production workload
> will be scheduled until they confirmed the end of the maintenance. But
> looks like this would be also ensured by the new design.
>
> In the past we already used another job orchestration system with a
> draining approach similar to the design proposal. In nearly all cases the
> operations team didn't manage to start the draining mode at the right time.
> Either it was too early, and we didn't use available hardware resources or
> it was too late and it unnecessarily interrupted productive workload.
> Especially for long-running tasks which are expensive at restarting, it
> wasn't a good way to mange scheduled down times.
>
> I don't know the implementation within Mesos and therefore I can't judge
> about the complexity but I think the main problem is that Mesos doesn't
> provide an intuitive interface for managing maintenance windows. The HTTP
> API isn't that complicated but you definitely need own or external tooling.
> Probably most people are already deterred from the JSON syntax with
> nanoseconds. Also, the lack of synchronisation of modifications can be a
> problem and makes it harder to implement tooling around the API. A new more
> fine-grain HTTP API would be a big improvement and would allow to implement
> a nice looking interface within the Mesos UI.
>
> It would be sad to see this great feature disappearing.
>
> Best regards,
> Christoph
>
>
> Christoph Heer
> SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany
>
> Mandatory Disclosure Statement: www.sap.com/impressum
> This e-mail may contain trade secrets or privileged, undisclosed, or
> otherwise
> confidential information. If you have received this e-mail in error, you
> are hereby
> notified that any review, copying, or distribution of it is strictly
> prohibited. Please inform
> us immediately and destroy the original transmittal. Thank you for your
> cooperation.
>
>
> > On 7. Jun 2019, at 09:56, Maxime Brugidou <ma...@gmail.com>
> wrote:
> >
> > I think that you are both correct about the fact that most users don't
> and won't use the schedules to plan maintenance in advance. The main reason
> is that frameworks just don't use this schedule and don't take inverse
> offers into account, but also most use cases are Ok with simply draining
> nodes one after the other without more logic.
> >
> > In the end Benjamin is right we are always hitting the same problem with
> Mesos, there is no good reference implementation of a scheduler, with all
> the features baked in. On our side we are mostly using open source
> schedulers (Marathon, Aurora, Flink...etc) for various use cases and they
> mostly don't leverage maintenance primitives. We started to use them for
> one custom use case where we are indeed building our own framework, and we
> want to provide some sort of "task duration" SLA which would clearly
> benefit from maintenance schedules. Honestly if we are the only users doing
> that, we can maintain the schedules on a separate service easily. I haven't
> seen any framework actually using the inverse offers though.
> >
> > I also agree that adding offer ordering in the allocator is probably not
> the best design since what we want in the end is probably some
> affinity/anti-affinity at the scheduler level based on the "time to reboot"
> for example. But again, this need cooperation from frameworks. My idea was
> more of a hack/prototype idea since I see that slaves are randomly sorted
> in the allocator and we could easily patch it to have a custom sort
> mechanism. But I completely agree that optimistic offers or similar
> techniques are the way to go.
> >
> > I don't think that we will ever get to the point of having a reference
> scheduler, the Mesos community would need to agree on one implementation
> and make sure that every new feature of Mesos gets implemented in the
> scheduler. This is a huge amount of work and coordination/design. The
> mesosphere dcos-commons library is one example of the complexity of such a
> project, it is dedicated to stateful services, is clearly coupled with
> DC/OS (although we are able to use it on bare Mesos too), and it's still
> difficult to use. However, having an open source scheduler exposing a
> higher-level friendly API via RPC (like kubernetes for example), is
> probably the only way to make Mesos more accessible for most users.
> >
> > On Fri, Jun 7, 2019 at 6:24 AM Benjamin Mahler <bm...@apache.org>
> wrote:
> > > With the new proposal, it's going to be as difficult as before to have
> SLA-aware maintenances because it will need cooperation from the frameworks
> anyway and we know this is rarely a priority for them. We will also lose
> the ability to signal future maintenance in order to optimize allocations.
> >
> > Personally, I think right now we should solve the basic need of draining
> a node. The plan to add SLA-awareness into draining was to introduce a
> capability that schedulers opt into that enables them to (1) take control
> over the killing of tasks when an agent is put into the draining state and
> (2) still get offers when an agent is the draining state in case the
> scheduler needs to restart a task that *must* run. This allows an SLA-aware
> scheduler to avoid killing during a drain if its task(s) will have SLAs
> violated.
> >
> > Perhaps this functionality can live alongside the maintenance schedule
> information we currently support, without being coupled together. As far as
> I'm aware that's something we hadn't considered (we considered integrating
> into the maintenance schedules or replacing them).
> >
> > > For example I had this idea to improve the allocator (or write a
> custom one) that would offer resources from agents with no maintenance
> planned in priority, and then sort agents by maintenance date in
> decremasing order.
> >
> > Right now there is no meaning to the order of offers. Adding some
> meaning to the ordering of offers quickly becomes an issue for us as soon
> as there are multiple criteria that need to be evaluated. For example, if
> you want to incorporate maintenance, load spreading, fault domain
> spreading, etc across machines, it becomes less clear how offers should be
> ordered. One could try to build some scoring model in mesos for ordering,
> but it will be woefully inadequate since Mesos does not know anything about
> the pending workloads: it's ultimately the schedulers that are best
> positioned to make these decisions. This is why we are going to move
> towards an "optimistic concurrency" model where schedulers can choose what
> they want and Mesos enforces constraints (e.g. quota limits), thereby
> eliminating the multi-scheduler scalability issues of the current offer
> model.
> >
> > And as somewhat of an aside, the lack of built-in scheduling has been
> bad for the Mesos ecosystem. The vast majority of users just need to
> schedule: services, jobs and cron jobs. These have a pretty standard look
> and feel (including the SLA aspect of them!). Many of the existing
> schedulers could be thinner "orchestrators" that know when to submit
> something to be scheduled by a common scheduler, rather than reimplementing
> all of the typical scheduling primitives (constraints, SLA awareness,
> dealing with the low level mesos scheduling API). My point here is that we
> ask too much of frameworks and it hurts users. I would love to see
> scheduling become more standardized and built into Mesos.
> >
> > On Thu, Jun 6, 2019 at 10:52 AM Greg Mann <gr...@mesosphere.io> wrote:
> > Maxime,
> > Thanks for the feedback, it's much appreciated. I agree that it would be
> possible to evolve the existing primitives to accomplish something similar
> to the proposal. That is one option that was considered before writing the
> design doc, but after some discussion, I thought that it seems more
> appropriate to start over with a simpler model that accomplishes what we
> perceive to be the predominant use case: the automated draining of agent
> nodes, without the concept of a maintenance window or designated
> maintenance time in the future. However, perhaps this perception is
> incorrect?
> >
> > Using maintenance metadata to alter the sorting order in the allocator
> is an interesting idea; currently, the allocator does not have access to
> information about maintenance, but it's conceivable that we could extend
> the allocator interface to accommodate this. While the currently-proposed
> design would not allow this, it would allow operators to deactivate nodes,
> which is an extreme version of this, since deactivated agents would never
> have their resources offered to frameworks. This provides a blunt mechanism
> to prevent scheduling on nodes which have upcoming maintenance, although it
> sounds like you see some benefit to a more subtle notion of scheduling
> priority based on upcoming maintenance? Do you think that maintenance-aware
> sorting would provide much more benefit to you over agent deactivation? Do
> you make use of the existing maintenance primitives to signal upcoming
> maintenance on agents?
> >
> > Thanks!
> > Greg
> >
> > On Thu, Jun 6, 2019 at 9:37 AM Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
> > Hi,
> >
> > As a Mesos operator, I am really surprised by this proposal.
> >
> > The main advantage of the proposed design is that we can finally set
> nodes down for maintenance with a configurable kill grace period and a
> proper task status (with maintenance primitives, it was TASK_LOST I think)
> without any specific cooperation from the frameworks.
> >
> > I think that this could be just an evolution of the current primitives.
> >
> > With the new proposal, it's going to be as difficult as before to have
> SLA-aware maintenances because it will need cooperation from the frameworks
> anyway and we know this is rarely a priority for them. We will also lose
> the ability to signal future maintenance in order to optimize allocations.
> >
> > For example I had this idea to improve the allocator (or write a custom
> one) that would offer resources from agents with no maintenance planned in
> priority, and then sort agents by maintenance date in decremasing order.
> This would be a big improvement to prevent cluster reboots to trigger too
> many task restarts. This will not be possible with the new primitives. The
> same idea apply for frameworks too.
> >
> > Maxime
> >
> > Le jeu. 30 mai 2019 à 22:16, Joseph Wu <jo...@mesosphere.io> a écrit :
> > As far as I can tell, the document is public.
> >
> > On Thu, May 30, 2019 at 12:22 AM Marc Roos <M....@f1-outsourcing.eu>
> wrote:
> >
> > Is the doc not public?
> >
> >
> > -----Original Message-----
> > From: Joseph Wu [mailto:joseph@mesosphere.io]
> > Sent: donderdag 30 mei 2019 2:07
> > To: dev; user
> > Subject: Design doc: Agent draining and deprecation of maintenance
> > primitives
> >
> > Hi all,
> >
> > A few years back, we added some constructs called maintenance primitives
> > to Mesos.  This feature was meant to allow operators and frameworks to
> > cooperate in draining tasks off nodes scheduled for maintenance.  As far
> > as we've observed since, this feature never achieved enough adoption to
> > be useful for operators.
> >
> > As such, we are proposing a more opinionated approach for draining
> > tasks.  The goal is to have Mesos perform draining in lieu of
> > frameworks, minimizing or eliminating the need to change frameworks to
> > account for draining.  We will also be simplifying the operator
> > workflow, which would only require a single call (holding an AgentID) to
> > start draining; and a single call to bring an agent back into the
> > cluster.
> >
> >
> > Due to how closely this proposed feature overlaps with maintenance
> > primitives, we will be deprecating maintenance primitives upon
> > implementation of agent draining.
> >
> >
> > If interested, please take a look at the design document:
> >
> >
> https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
> >
> >
>
>

Re: Design doc: Agent draining and deprecation of maintenance primitives

Posted by Greg Mann <gr...@mesosphere.io>.

Christoph,
Great to hear that you're using the maintenance primitives! It seems unwise
for us to deprecate this part of the API given the fact that you and Maxime
have both expressed a desire for it to stick around. I'll adjust the agent
draining design doc to remove the deprecation of that feature. Many thanks
for your feedback.

Greg

On Fri, Jun 7, 2019 at 9:24 PM Heer, Christoph <ch...@sap.com>
wrote:

> Hi everyone,
>
> my team and I implemented our own Mesos framework for task execution on
> our bare-metal on-prem cluster.
> Especially for task processing workload with known or estimated task
> duration, the available Mesos maintenance primitives are super powerful for
> scheduler and operators. While developing the scheduler, I hadn't the
> feeling it would be complex to support/respect maintenance windows. Already
> the small logic "Should I launch task X with estimated runtime 3h on node Y
> with scheduled maintenance in 40min?" saved us tons of aborted tasks. Our
> hardware operations team also really likes the way to plan and express
> maintenance windows upfront. Days before the actually maintenance they can
> add the information and the node will be ready at that point in time. Also,
> they can reboot the machines without the fear that any production workload
> will be scheduled until they confirmed the end of the maintenance. But
> looks like this would be also ensured by the new design.
>
> In the past we already used another job orchestration system with a
> draining approach similar to the design proposal. In nearly all cases the
> operations team didn't manage to start the draining mode at the right time.
> Either it was too early, and we didn't use available hardware resources or
> it was too late and it unnecessarily interrupted productive workload.
> Especially for long-running tasks which are expensive at restarting, it
> wasn't a good way to mange scheduled down times.
>
> I don't know the implementation within Mesos and therefore I can't judge
> about the complexity but I think the main problem is that Mesos doesn't
> provide an intuitive interface for managing maintenance windows. The HTTP
> API isn't that complicated but you definitely need own or external tooling.
> Probably most people are already deterred from the JSON syntax with
> nanoseconds. Also, the lack of synchronisation of modifications can be a
> problem and makes it harder to implement tooling around the API. A new more
> fine-grain HTTP API would be a big improvement and would allow to implement
> a nice looking interface within the Mesos UI.
>
> It would be sad to see this great feature disappearing.
>
> Best regards,
> Christoph
>
>
> Christoph Heer
> SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany
>
> Mandatory Disclosure Statement: www.sap.com/impressum
> This e-mail may contain trade secrets or privileged, undisclosed, or
> otherwise
> confidential information. If you have received this e-mail in error, you
> are hereby
> notified that any review, copying, or distribution of it is strictly
> prohibited. Please inform
> us immediately and destroy the original transmittal. Thank you for your
> cooperation.
>
>
> > On 7. Jun 2019, at 09:56, Maxime Brugidou <ma...@gmail.com>
> wrote:
> >
> > I think that you are both correct about the fact that most users don't
> and won't use the schedules to plan maintenance in advance. The main reason
> is that frameworks just don't use this schedule and don't take inverse
> offers into account, but also most use cases are Ok with simply draining
> nodes one after the other without more logic.
> >
> > In the end Benjamin is right we are always hitting the same problem with
> Mesos, there is no good reference implementation of a scheduler, with all
> the features baked in. On our side we are mostly using open source
> schedulers (Marathon, Aurora, Flink...etc) for various use cases and they
> mostly don't leverage maintenance primitives. We started to use them for
> one custom use case where we are indeed building our own framework, and we
> want to provide some sort of "task duration" SLA which would clearly
> benefit from maintenance schedules. Honestly if we are the only users doing
> that, we can maintain the schedules on a separate service easily. I haven't
> seen any framework actually using the inverse offers though.
> >
> > I also agree that adding offer ordering in the allocator is probably not
> the best design since what we want in the end is probably some
> affinity/anti-affinity at the scheduler level based on the "time to reboot"
> for example. But again, this need cooperation from frameworks. My idea was
> more of a hack/prototype idea since I see that slaves are randomly sorted
> in the allocator and we could easily patch it to have a custom sort
> mechanism. But I completely agree that optimistic offers or similar
> techniques are the way to go.
> >
> > I don't think that we will ever get to the point of having a reference
> scheduler, the Mesos community would need to agree on one implementation
> and make sure that every new feature of Mesos gets implemented in the
> scheduler. This is a huge amount of work and coordination/design. The
> mesosphere dcos-commons library is one example of the complexity of such a
> project, it is dedicated to stateful services, is clearly coupled with
> DC/OS (although we are able to use it on bare Mesos too), and it's still
> difficult to use. However, having an open source scheduler exposing a
> higher-level friendly API via RPC (like kubernetes for example), is
> probably the only way to make Mesos more accessible for most users.
> >
> > On Fri, Jun 7, 2019 at 6:24 AM Benjamin Mahler <bm...@apache.org>
> wrote:
> > > With the new proposal, it's going to be as difficult as before to have
> SLA-aware maintenances because it will need cooperation from the frameworks
> anyway and we know this is rarely a priority for them. We will also lose
> the ability to signal future maintenance in order to optimize allocations.
> >
> > Personally, I think right now we should solve the basic need of draining
> a node. The plan to add SLA-awareness into draining was to introduce a
> capability that schedulers opt into that enables them to (1) take control
> over the killing of tasks when an agent is put into the draining state and
> (2) still get offers when an agent is the draining state in case the
> scheduler needs to restart a task that *must* run. This allows an SLA-aware
> scheduler to avoid killing during a drain if its task(s) will have SLAs
> violated.
> >
> > Perhaps this functionality can live alongside the maintenance schedule
> information we currently support, without being coupled together. As far as
> I'm aware that's something we hadn't considered (we considered integrating
> into the maintenance schedules or replacing them).
> >
> > > For example I had this idea to improve the allocator (or write a
> custom one) that would offer resources from agents with no maintenance
> planned in priority, and then sort agents by maintenance date in
> decremasing order.
> >
> > Right now there is no meaning to the order of offers. Adding some
> meaning to the ordering of offers quickly becomes an issue for us as soon
> as there are multiple criteria that need to be evaluated. For example, if
> you want to incorporate maintenance, load spreading, fault domain
> spreading, etc across machines, it becomes less clear how offers should be
> ordered. One could try to build some scoring model in mesos for ordering,
> but it will be woefully inadequate since Mesos does not know anything about
> the pending workloads: it's ultimately the schedulers that are best
> positioned to make these decisions. This is why we are going to move
> towards an "optimistic concurrency" model where schedulers can choose what
> they want and Mesos enforces constraints (e.g. quota limits), thereby
> eliminating the multi-scheduler scalability issues of the current offer
> model.
> >
> > And as somewhat of an aside, the lack of built-in scheduling has been
> bad for the Mesos ecosystem. The vast majority of users just need to
> schedule: services, jobs and cron jobs. These have a pretty standard look
> and feel (including the SLA aspect of them!). Many of the existing
> schedulers could be thinner "orchestrators" that know when to submit
> something to be scheduled by a common scheduler, rather than reimplementing
> all of the typical scheduling primitives (constraints, SLA awareness,
> dealing with the low level mesos scheduling API). My point here is that we
> ask too much of frameworks and it hurts users. I would love to see
> scheduling become more standardized and built into Mesos.
> >
> > On Thu, Jun 6, 2019 at 10:52 AM Greg Mann <gr...@mesosphere.io> wrote:
> > Maxime,
> > Thanks for the feedback, it's much appreciated. I agree that it would be
> possible to evolve the existing primitives to accomplish something similar
> to the proposal. That is one option that was considered before writing the
> design doc, but after some discussion, I thought that it seems more
> appropriate to start over with a simpler model that accomplishes what we
> perceive to be the predominant use case: the automated draining of agent
> nodes, without the concept of a maintenance window or designated
> maintenance time in the future. However, perhaps this perception is
> incorrect?
> >
> > Using maintenance metadata to alter the sorting order in the allocator
> is an interesting idea; currently, the allocator does not have access to
> information about maintenance, but it's conceivable that we could extend
> the allocator interface to accommodate this. While the currently-proposed
> design would not allow this, it would allow operators to deactivate nodes,
> which is an extreme version of this, since deactivated agents would never
> have their resources offered to frameworks. This provides a blunt mechanism
> to prevent scheduling on nodes which have upcoming maintenance, although it
> sounds like you see some benefit to a more subtle notion of scheduling
> priority based on upcoming maintenance? Do you think that maintenance-aware
> sorting would provide much more benefit to you over agent deactivation? Do
> you make use of the existing maintenance primitives to signal upcoming
> maintenance on agents?
> >
> > Thanks!
> > Greg
> >
> > On Thu, Jun 6, 2019 at 9:37 AM Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
> > Hi,
> >
> > As a Mesos operator, I am really surprised by this proposal.
> >
> > The main advantage of the proposed design is that we can finally set
> nodes down for maintenance with a configurable kill grace period and a
> proper task status (with maintenance primitives, it was TASK_LOST I think)
> without any specific cooperation from the frameworks.
> >
> > I think that this could be just an evolution of the current primitives.
> >
> > With the new proposal, it's going to be as difficult as before to have
> SLA-aware maintenances because it will need cooperation from the frameworks
> anyway and we know this is rarely a priority for them. We will also lose
> the ability to signal future maintenance in order to optimize allocations.
> >
> > For example I had this idea to improve the allocator (or write a custom
> one) that would offer resources from agents with no maintenance planned in
> priority, and then sort agents by maintenance date in decremasing order.
> This would be a big improvement to prevent cluster reboots to trigger too
> many task restarts. This will not be possible with the new primitives. The
> same idea apply for frameworks too.
> >
> > Maxime
> >
> > Le jeu. 30 mai 2019 à 22:16, Joseph Wu <jo...@mesosphere.io> a écrit :
> > As far as I can tell, the document is public.
> >
> > On Thu, May 30, 2019 at 12:22 AM Marc Roos <M....@f1-outsourcing.eu>
> wrote:
> >
> > Is the doc not public?
> >
> >
> > -----Original Message-----
> > From: Joseph Wu [mailto:joseph@mesosphere.io]
> > Sent: donderdag 30 mei 2019 2:07
> > To: dev; user
> > Subject: Design doc: Agent draining and deprecation of maintenance
> > primitives
> >
> > Hi all,
> >
> > A few years back, we added some constructs called maintenance primitives
> > to Mesos.  This feature was meant to allow operators and frameworks to
> > cooperate in draining tasks off nodes scheduled for maintenance.  As far
> > as we've observed since, this feature never achieved enough adoption to
> > be useful for operators.
> >
> > As such, we are proposing a more opinionated approach for draining
> > tasks.  The goal is to have Mesos perform draining in lieu of
> > frameworks, minimizing or eliminating the need to change frameworks to
> > account for draining.  We will also be simplifying the operator
> > workflow, which would only require a single call (holding an AgentID) to
> > start draining; and a single call to bring an agent back into the
> > cluster.
> >
> >
> > Due to how closely this proposed feature overlaps with maintenance
> > primitives, we will be deprecating maintenance primitives upon
> > implementation of agent draining.
> >
> >
> > If interested, please take a look at the design document:
> >
> >
> https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
> >
> >
>
>