You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Evers Benno <be...@yandex-team.ru> on 2016/07/13 12:44:40 UTC

Registering and framework failover

Hi all,

imagine the following situation: I am a framework with failover timeout
of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
register with the master again.

If my registration attempt arrives at the master within the time limit
everything will be fine and I even get back the old tasks for
reconciliation, but if it arrives slightly later the framework id is
permanently blocked by mesos, and I am not able to register. Instead, I
will receive an error()-callback with the message "Framework has been
removed".

Is there any way to reliably connect to the master while also
reconciling old tasks if possible?

I was looking around how other frameworks solve this, but it seems that
Kafka doesn't handle this at all
(https://dcosjira.atlassian.net/browse/KAFKA-4), and Marathon scans the
error message for the string "Framework has been removed" and changes
the framework id in this case.

If the latter is the intended solution, are these strings considered
part of the mesos API? Is it guaranteed they will not be changed after
the 1.0 release?

Best regards,
Benno

Re: Registering and framework failover

Posted by Vinod Kone <vi...@apache.org>.
As Neil mentioned, we plan to add error codes for asynchronous errors
(error() callback in the old API, Error event in the new API) and
synchronous errors (HTTP 4xx/5xx responses in the new API).

Having said that, I would advise against adding logic in your framework to
do something smart (like removing its framework id from a persistent store)
when it gets an error saying "the framework has been removed". I would
rather have the framework exit/crash and require an operator involvement to
rectify. Mainly because if you are in this situation there is something
really bad happening in your cluster regarding your expectations and it
should be really a very rare event. Having human involvement in rare
catastrophic events is probably better.

On Thu, Jul 14, 2016 at 3:32 AM, Evers Benno <be...@yandex-team.ru> wrote:

> So, given that this probably won't be changed before the 1.0 release,
> are the strings considered part of the stable API? Or is it recommended
> not to rely on `error()` at all? (That's what we did for now, setting
> failover timeout to 5 years)
>
> On 13.07.2016 15:37, Neil Conway wrote:
> > Ah, right -- yes, at the moment you need to look at error strings to
> > decide whether to retry with a new framework ID, unfortunately. IMO we
> > should introduce error codes or enums to make this process more
> > reliable, but no one has done so yet:
> >
> > https://issues.apache.org/jira/browse/MESOS-4548
> > https://issues.apache.org/jira/browse/MESOS-5322
> >
> > Neil
> >
> >
> > On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <be...@yandex-team.ru>
> wrote:
> >> Let me try to clarify:
> >>
> >> The problem is that I don't get to decide manually if the framwork
> >> should try to take a new id or re-use the old one, but it needs to be
> >> decided programmatically, by an algorithm.
> >>
> >> Afaik it's not possible to get the time when the framework disconnected
> >> from mesos, so it's not possible to know how much time is left until the
> >> failover timeout runs out. Therefore, if I want to attempt task
> >> reconciliation, I just have to try registering with my old framework id
> >> and see what happens.
> >>
> >> However, in the case where the failover timeout already passed, I now
> >> need to programmatically detect this error and try again with an empty
> >> framework id.
> >>
> >> My question was, is it possible to do this?
> >>
> >> (also, we actually use a failover timeout of 1 week, but it doesn't
> >> really change the problem and I mistakenly assumed that an example with
> >> smaller values would be more intuitive)
> >>
> >> On 13.07.2016 14:50, Neil Conway wrote:
> >>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <be...@yandex-team.ru>
> wrote:
> >>>> imagine the following situation: I am a framework with failover
> timeout
> >>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
> >>>> register with the master again.
> >>>>
> >>>> If my registration attempt arrives at the master within the time limit
> >>>> everything will be fine and I even get back the old tasks for
> >>>> reconciliation, but if it arrives slightly later the framework id is
> >>>> permanently blocked by mesos, and I am not able to register. Instead,
> I
> >>>> will receive an error()-callback with the message "Framework has been
> >>>> removed".
> >>>
> >>> Right: if you set a failover_timeout of 1 hour, your framework is
> >>> expected to reregister within one hour. If it does not, all of its
> >>> tasks will be killed and you need to start over with a new
> >>> FrameworkID. Can you clarify which aspect of this behavior is
> >>> problematic for you?
> >>>
> >>> Note that a failover_timeout of 1 hour is probably a little low.
> >>>
> >>>> Is there any way to reliably connect to the master while also
> >>>> reconciling old tasks if possible?
> >>>
> >>> Sorry, not sure what you mean by this.
> >>>
> >>> Neil
> >>>
>

Re: Registering and framework failover

Posted by Evers Benno <be...@yandex-team.ru>.
So, given that this probably won't be changed before the 1.0 release,
are the strings considered part of the stable API? Or is it recommended
not to rely on `error()` at all? (That's what we did for now, setting
failover timeout to 5 years)

On 13.07.2016 15:37, Neil Conway wrote:
> Ah, right -- yes, at the moment you need to look at error strings to
> decide whether to retry with a new framework ID, unfortunately. IMO we
> should introduce error codes or enums to make this process more
> reliable, but no one has done so yet:
> 
> https://issues.apache.org/jira/browse/MESOS-4548
> https://issues.apache.org/jira/browse/MESOS-5322
> 
> Neil
> 
> 
> On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <be...@yandex-team.ru> wrote:
>> Let me try to clarify:
>>
>> The problem is that I don't get to decide manually if the framwork
>> should try to take a new id or re-use the old one, but it needs to be
>> decided programmatically, by an algorithm.
>>
>> Afaik it's not possible to get the time when the framework disconnected
>> from mesos, so it's not possible to know how much time is left until the
>> failover timeout runs out. Therefore, if I want to attempt task
>> reconciliation, I just have to try registering with my old framework id
>> and see what happens.
>>
>> However, in the case where the failover timeout already passed, I now
>> need to programmatically detect this error and try again with an empty
>> framework id.
>>
>> My question was, is it possible to do this?
>>
>> (also, we actually use a failover timeout of 1 week, but it doesn't
>> really change the problem and I mistakenly assumed that an example with
>> smaller values would be more intuitive)
>>
>> On 13.07.2016 14:50, Neil Conway wrote:
>>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <be...@yandex-team.ru> wrote:
>>>> imagine the following situation: I am a framework with failover timeout
>>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
>>>> register with the master again.
>>>>
>>>> If my registration attempt arrives at the master within the time limit
>>>> everything will be fine and I even get back the old tasks for
>>>> reconciliation, but if it arrives slightly later the framework id is
>>>> permanently blocked by mesos, and I am not able to register. Instead, I
>>>> will receive an error()-callback with the message "Framework has been
>>>> removed".
>>>
>>> Right: if you set a failover_timeout of 1 hour, your framework is
>>> expected to reregister within one hour. If it does not, all of its
>>> tasks will be killed and you need to start over with a new
>>> FrameworkID. Can you clarify which aspect of this behavior is
>>> problematic for you?
>>>
>>> Note that a failover_timeout of 1 hour is probably a little low.
>>>
>>>> Is there any way to reliably connect to the master while also
>>>> reconciling old tasks if possible?
>>>
>>> Sorry, not sure what you mean by this.
>>>
>>> Neil
>>>

Re: Registering and framework failover

Posted by Neil Conway <ne...@gmail.com>.
Ah, right -- yes, at the moment you need to look at error strings to
decide whether to retry with a new framework ID, unfortunately. IMO we
should introduce error codes or enums to make this process more
reliable, but no one has done so yet:

https://issues.apache.org/jira/browse/MESOS-4548
https://issues.apache.org/jira/browse/MESOS-5322

Neil


On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <be...@yandex-team.ru> wrote:
> Let me try to clarify:
>
> The problem is that I don't get to decide manually if the framwork
> should try to take a new id or re-use the old one, but it needs to be
> decided programmatically, by an algorithm.
>
> Afaik it's not possible to get the time when the framework disconnected
> from mesos, so it's not possible to know how much time is left until the
> failover timeout runs out. Therefore, if I want to attempt task
> reconciliation, I just have to try registering with my old framework id
> and see what happens.
>
> However, in the case where the failover timeout already passed, I now
> need to programmatically detect this error and try again with an empty
> framework id.
>
> My question was, is it possible to do this?
>
> (also, we actually use a failover timeout of 1 week, but it doesn't
> really change the problem and I mistakenly assumed that an example with
> smaller values would be more intuitive)
>
> On 13.07.2016 14:50, Neil Conway wrote:
>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <be...@yandex-team.ru> wrote:
>>> imagine the following situation: I am a framework with failover timeout
>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
>>> register with the master again.
>>>
>>> If my registration attempt arrives at the master within the time limit
>>> everything will be fine and I even get back the old tasks for
>>> reconciliation, but if it arrives slightly later the framework id is
>>> permanently blocked by mesos, and I am not able to register. Instead, I
>>> will receive an error()-callback with the message "Framework has been
>>> removed".
>>
>> Right: if you set a failover_timeout of 1 hour, your framework is
>> expected to reregister within one hour. If it does not, all of its
>> tasks will be killed and you need to start over with a new
>> FrameworkID. Can you clarify which aspect of this behavior is
>> problematic for you?
>>
>> Note that a failover_timeout of 1 hour is probably a little low.
>>
>>> Is there any way to reliably connect to the master while also
>>> reconciling old tasks if possible?
>>
>> Sorry, not sure what you mean by this.
>>
>> Neil
>>

Re: Registering and framework failover

Posted by Evers Benno <be...@yandex-team.ru>.
Let me try to clarify:

The problem is that I don't get to decide manually if the framwork
should try to take a new id or re-use the old one, but it needs to be
decided programmatically, by an algorithm.

Afaik it's not possible to get the time when the framework disconnected
from mesos, so it's not possible to know how much time is left until the
failover timeout runs out. Therefore, if I want to attempt task
reconciliation, I just have to try registering with my old framework id
and see what happens.

However, in the case where the failover timeout already passed, I now
need to programmatically detect this error and try again with an empty
framework id.

My question was, is it possible to do this?

(also, we actually use a failover timeout of 1 week, but it doesn't
really change the problem and I mistakenly assumed that an example with
smaller values would be more intuitive)

On 13.07.2016 14:50, Neil Conway wrote:
> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <be...@yandex-team.ru> wrote:
>> imagine the following situation: I am a framework with failover timeout
>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
>> register with the master again.
>>
>> If my registration attempt arrives at the master within the time limit
>> everything will be fine and I even get back the old tasks for
>> reconciliation, but if it arrives slightly later the framework id is
>> permanently blocked by mesos, and I am not able to register. Instead, I
>> will receive an error()-callback with the message "Framework has been
>> removed".
> 
> Right: if you set a failover_timeout of 1 hour, your framework is
> expected to reregister within one hour. If it does not, all of its
> tasks will be killed and you need to start over with a new
> FrameworkID. Can you clarify which aspect of this behavior is
> problematic for you?
> 
> Note that a failover_timeout of 1 hour is probably a little low.
> 
>> Is there any way to reliably connect to the master while also
>> reconciling old tasks if possible?
> 
> Sorry, not sure what you mean by this.
> 
> Neil
> 

Re: Registering and framework failover

Posted by Neil Conway <ne...@gmail.com>.
On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <be...@yandex-team.ru> wrote:
> imagine the following situation: I am a framework with failover timeout
> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
> register with the master again.
>
> If my registration attempt arrives at the master within the time limit
> everything will be fine and I even get back the old tasks for
> reconciliation, but if it arrives slightly later the framework id is
> permanently blocked by mesos, and I am not able to register. Instead, I
> will receive an error()-callback with the message "Framework has been
> removed".

Right: if you set a failover_timeout of 1 hour, your framework is
expected to reregister within one hour. If it does not, all of its
tasks will be killed and you need to start over with a new
FrameworkID. Can you clarify which aspect of this behavior is
problematic for you?

Note that a failover_timeout of 1 hour is probably a little low.

> Is there any way to reliably connect to the master while also
> reconciling old tasks if possible?

Sorry, not sure what you mean by this.

Neil