You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Marcus Larsson <ma...@oracle.com> on 2015/10/09 12:48:38 UTC

Framework control over slave recovery

Hi,

I'm part of a project investigating the use of Mesos for a distributed 
build and test system. For some of our tasks we would like to have more 
control over the slave recovery policy. Currently, when a slave fails 
its health check, it seems Mesos will always mark any task on the slave 
as lost, and shutdown the slave when (or if) it reconnects. We would 
like the framework to have more information and control over this.

I found an issue [1] in JIRA that mentions implementing something like 
this, but it seems only the part with the slave removal rate limiter was 
implemented. What I'm wondering is if there is any support in Mesos for 
letting the framework decide how to handle slave removal/recovery?

For our case, we would like the framework to be notified when a slave 
fails its health check, so that the appropriate action for the task 
running on that slave can be taken. Some of our tasks will be very long 
running and we don't want to restart a few days worth of work because 
the network was down for a while.

Thanks,
Marcus

[1]: https://issues.apache.org/jira/browse/MESOS-2246

Re: Framework control over slave recovery

Posted by Marco Massenzio <ma...@mesosphere.io>.

It sounds to me a reasonable expectation that the framework may be notified
if the agent(s) that are running one or more of its tasks starts showing
signs of unhealthiness - in most instances, we would expect them to happily
ignore such situation and just let Mesos take care of the matter, but if
they do care, they should be able to know.

Not so sure about the feasibility of a 'per task timeout', but the
notification would be probably not too complicated (although, it does open
up a whole new area of debate around implementation and how to modify the
API to enable that).

Could you please file a Jira requesting this as a feature on the Master?

Thanks!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Oct 9, 2015 at 3:29 PM, Marcus Larsson <ma...@oracle.com>
wrote:

> Hi,
>
> On 2015-10-09 15:26, Marco Massenzio wrote:
>
> The 'marking' of the task is not immediate: Master actually waits a beat
> or two to see if the Agent reconnects, there are various flags that control
> behavior around this [0].
>
> Naive question: I am assuming that you already looked into a combination
> of:
>
> --max_slave_ping_timeouts=VALUE
> --slave_ping_timeout=VALUE
> --slave_removal_rate_limit=VALUE
> --slave_reregister_timeout=VALUE
>
> that may help with your use case?
> I'm not really an expert into these flags, so not entirely sure whether a
> combination thereof may work with your scenario.
>
>
> Yeah I've seen and tried using these flags. While they can be used to
> prevent Mesos from killing the agents too quickly, the framework will not
> be notified about the slave failing the health checks unless it times out
> completely and the task is lost. Also, ideally we would want per-task
> timeouts, whereas these settings are global.
>
> Thanks,
> Marcus
>
>
> [0] http://mesos.apache.org/documentation/latest/configuration/
>
>
>
>
> *Marco Massenzio*
>
> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>*
>
> On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson <marcus.larsson@oracle.com
> > wrote:
>
>> Hi,
>>
>> I'm part of a project investigating the use of Mesos for a distributed
>> build and test system. For some of our tasks we would like to have more
>> control over the slave recovery policy. Currently, when a slave fails its
>> health check, it seems Mesos will always mark any task on the slave as
>> lost, and shutdown the slave when (or if) it reconnects. We would like the
>> framework to have more information and control over this.
>>
>> I found an issue [1] in JIRA that mentions implementing something like
>> this, but it seems only the part with the slave removal rate limiter was
>> implemented. What I'm wondering is if there is any support in Mesos for
>> letting the framework decide how to handle slave removal/recovery?
>>
>> For our case, we would like the framework to be notified when a slave
>> fails its health check, so that the appropriate action for the task running
>> on that slave can be taken. Some of our tasks will be very long running and
>> we don't want to restart a few days worth of work because the network was
>> down for a while.
>>
>> Thanks,
>> Marcus
>>
>> [1]: https://issues.apache.org/jira/browse/MESOS-2246
>>
>
>
>

Re: Framework control over slave recovery

Posted by Marcus Larsson <ma...@oracle.com>.

Hi,

On 2015-10-09 15:26, Marco Massenzio wrote:
> The 'marking' of the task is not immediate: Master actually waits a 
> beat or two to see if the Agent reconnects, there are various flags 
> that control behavior around this [0].
>
> Naive question: I am assuming that you already looked into a 
> combination of:
>
> --max_slave_ping_timeouts=VALUE
> --slave_ping_timeout=VALUE
> --slave_removal_rate_limit=VALUE
> --slave_reregister_timeout=VALUE
>
> that may help with your use case?
> I'm not really an expert into these flags, so not entirely sure 
> whether a combination thereof may work with your scenario.

Yeah I've seen and tried using these flags. While they can be used to 
prevent Mesos from killing the agents too quickly, the framework will 
not be notified about the slave failing the health checks unless it 
times out completely and the task is lost. Also, ideally we would want 
per-task timeouts, whereas these settings are global.

Thanks,
Marcus

>
> [0] http://mesos.apache.org/documentation/latest/configuration/
>
>
>
>
> /Marco Massenzio/
> /Distributed Systems Engineer
> http://codetrips.com/
>
> On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson 
> <marcus.larsson@oracle.com <ma...@oracle.com>> wrote:
>
>     Hi,
>
>     I'm part of a project investigating the use of Mesos for a
>     distributed build and test system. For some of our tasks we would
>     like to have more control over the slave recovery policy.
>     Currently, when a slave fails its health check, it seems Mesos
>     will always mark any task on the slave as lost, and shutdown the
>     slave when (or if) it reconnects. We would like the framework to
>     have more information and control over this.
>
>     I found an issue [1] in JIRA that mentions implementing something
>     like this, but it seems only the part with the slave removal rate
>     limiter was implemented. What I'm wondering is if there is any
>     support in Mesos for letting the framework decide how to handle
>     slave removal/recovery?
>
>     For our case, we would like the framework to be notified when a
>     slave fails its health check, so that the appropriate action for
>     the task running on that slave can be taken. Some of our tasks
>     will be very long running and we don't want to restart a few days
>     worth of work because the network was down for a while.
>
>     Thanks,
>     Marcus
>
>     [1]: https://issues.apache.org/jira/browse/MESOS-2246
>
>

Re: Framework control over slave recovery

Posted by Marco Massenzio <ma...@mesosphere.io>.

The 'marking' of the task is not immediate: Master actually waits a beat or
two to see if the Agent reconnects, there are various flags that control
behavior around this [0].

Naive question: I am assuming that you already looked into a combination of:

--max_slave_ping_timeouts=VALUE
--slave_ping_timeout=VALUE
--slave_removal_rate_limit=VALUE
--slave_reregister_timeout=VALUE

that may help with your use case?
I'm not really an expert into these flags, so not entirely sure whether a
combination thereof may work with your scenario.

[0] http://mesos.apache.org/documentation/latest/configuration/




*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson <ma...@oracle.com>
wrote:

> Hi,
>
> I'm part of a project investigating the use of Mesos for a distributed
> build and test system. For some of our tasks we would like to have more
> control over the slave recovery policy. Currently, when a slave fails its
> health check, it seems Mesos will always mark any task on the slave as
> lost, and shutdown the slave when (or if) it reconnects. We would like the
> framework to have more information and control over this.
>
> I found an issue [1] in JIRA that mentions implementing something like
> this, but it seems only the part with the slave removal rate limiter was
> implemented. What I'm wondering is if there is any support in Mesos for
> letting the framework decide how to handle slave removal/recovery?
>
> For our case, we would like the framework to be notified when a slave
> fails its health check, so that the appropriate action for the task running
> on that slave can be taken. Some of our tasks will be very long running and
> we don't want to restart a few days worth of work because the network was
> down for a while.
>
> Thanks,
> Marcus
>
> [1]: https://issues.apache.org/jira/browse/MESOS-2246
>