You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2015/04/06 21:25:12 UTC
[jira] [Updated] (MESOS-304) Master should register a slave only
after it confirms it can talk to the slave
[ https://issues.apache.org/jira/browse/MESOS-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kone updated MESOS-304:
-----------------------------
Issue Type: Epic (was: Improvement)
> Master should register a slave only after it confirms it can talk to the slave
> ------------------------------------------------------------------------------
>
> Key: MESOS-304
> URL: https://issues.apache.org/jira/browse/MESOS-304
> Project: Mesos
> Issue Type: Epic
> Reporter: Vinod Kone
>
> We have seen this issue from users running on EC2 and also at Twitter.
> The crux of the issue is that, the master starts offering the resources of a slave as soon as it gets a Register message. If for some reason the master --> slave connection is not viable (e.g. slave used its private ip address, DNS failures), we end up in a loop as follows:
> --> Slave sends Register message to master
> --> Master accepts it and offers resources to the framework
> --> The slave health checks to the slave keeps failing
> --> Framework launches tasks on this slave, which would be dropped on the floor
> --> After health check timeout (>60s), master disconnects the slave
> --> Slave sends a Register message again.
> --> Repeat
> One way to solve this problem is to do a 3-way handshake for registration.
> This should also be done for framework registration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)