You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2014/09/12 02:56:33 UTC

[jira] [Updated] (MESOS-1668) Handle a temporary one-way master --> slave socket closure.

     [ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kone updated MESOS-1668:
------------------------------
          Sprint: Mesos Q3 Sprint 5
        Assignee: Vinod Kone
    Story Points: 2

The plan is to handle this by piggybacking the current slave state (e.g., bool registered) on the ping/pong messages.

When the slave receives a ping message which says that the master thinks the slave is disconnected but slave doesn't know it yet (socket only broke on the master side), slave will attempt a re-registration.

> Handle a temporary one-way master --> slave socket closure.
> -----------------------------------------------------------
>
>                 Key: MESOS-1668
>                 URL: https://issues.apache.org/jira/browse/MESOS-1668
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>            Reporter: Benjamin Mahler
>            Assignee: Vinod Kone
>            Priority: Minor
>              Labels: reliability
>
> In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs:
> → Master and Slave connected operating normally.
> → Temporary one-way network failure, master→slave link breaks.
> → Master marks slave as disconnected.
> → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again.
> → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad!
> We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation.
> Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)