You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Li Jin <ic...@gmail.com> on 2013/09/05 23:54:57 UTC

Messaging reliability in Mesos

Hi Mesosers,

I am wondering how reliable is messaging in Mesos. I didn't find any
documentation about it. For instances, are the schedulers guaranteed to
receive task status no matter what? Even when a TASK_FINISHED message is
sent when the master is failing over? There are probably too many failure
cases so I just want to have a general idea.

Thanks,
Li

Re: Messaging reliability in Mesos

Posted by Benjamin Mahler <be...@gmail.com>.

I created https://issues.apache.org/jira/browse/MESOS-681 so that we can
document this for framework developers. In 0.15.0, lost updates will be
fairly minimal so we consider documenting any known cases where these can
occur.

On Thu, Sep 5, 2013 at 3:20 PM, Vinod Kone <vi...@gmail.com> wrote:

> tl:dr; If the master fails over when a slave fails, there is a (small)
> chance that status updates of that slave are not reliably sent to the
> scheduler.
>
> In the earlier versions (pre 0.14.0) of mesos, when the master fails over
> at the same time as a slave failure, pending status updates of that slave
> were not sent to the scheduler.
>
> In 0.14.0, we are introducing a new feature called "Slave Recovery" where
> slaves checkpoint status updates information to disk. This increases the
> reliability of status updates. But there are still some cases where updates
> are not reliably retried (e.g., master fails over when the slave fails but
> slave never comes back up).
>
> In 0.15.0, we plan to introduce another feature called "Registrar" where
> masters checkpoint slave info to durable storage. This reduces the
> probability of lost updates even further. Unfortunately, even this wouldn't
> give a 100% reliability guarantee on the delivery of status updates.
>

In this case, we will be able to send slaveLost which is strictly better
than sending nothing. Schedulers could act on this signal in 0.15.0. The
caveat here is that we will have to consider making the sending of
slaveLost reliable, otherwise if the scheduler is failing over, they will
be dropped.


>
> On Thu, Sep 5, 2013 at 2:54 PM, Li Jin <ic...@gmail.com> wrote:
>
>> Hi Mesosers,
>>
>> I am wondering how reliable is messaging in Mesos. I didn't find any
>> documentation about it. For instances, are the schedulers guaranteed to
>> receive task status no matter what? Even when a TASK_FINISHED message is
>> sent when the master is failing over? There are probably too many failure
>> cases so I just want to have a general idea.
>>
>> Thanks,
>> Li
>>
>
>

Re: Messaging reliability in Mesos

Posted by Vinod Kone <vi...@gmail.com>.

tl:dr; If the master fails over when a slave fails, there is a (small)
chance that status updates of that slave are not reliably sent to the
scheduler.

In the earlier versions (pre 0.14.0) of mesos, when the master fails over
at the same time as a slave failure, pending status updates of that slave
were not sent to the scheduler.

In 0.14.0, we are introducing a new feature called "Slave Recovery" where
slaves checkpoint status updates information to disk. This increases the
reliability of status updates. But there are still some cases where updates
are not reliably retried (e.g., master fails over when the slave fails but
slave never comes back up).

In 0.15.0, we plan to introduce another feature called "Registrar" where
masters checkpoint slave info to durable storage. This reduces the
probability of lost updates even further. Unfortunately, even this wouldn't
give a 100% reliability guarantee on the delivery of status updates.

On Thu, Sep 5, 2013 at 2:54 PM, Li Jin <ic...@gmail.com> wrote:

> Hi Mesosers,
>
> I am wondering how reliable is messaging in Mesos. I didn't find any
> documentation about it. For instances, are the schedulers guaranteed to
> receive task status no matter what? Even when a TASK_FINISHED message is
> sent when the master is failing over? There are probably too many failure
> cases so I just want to have a general idea.
>
> Thanks,
> Li
>