You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by "Qinghe Jin (JIRA)" <ji...@apache.org> on 2012/11/10 17:23:12 UTC

[jira] [Commented] (MESOS-304) Master should register a slave only after it confirms it can talk to the slave

    [ https://issues.apache.org/jira/browse/MESOS-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494722#comment-13494722 ] 

Qinghe Jin commented on MESOS-304:
----------------------------------

I had also met a similar problem cause by lack of health check. Sometimes, when the slave fails to start, the master rushes to offer resource to hadoop framework without health check. This leads to hadoop fail to get slave info and reporting error to the hadoop-*jobtracker*.log again and again. As a result, disk space is used up by the log file.

I wonder if we need to check the health of all the link directions in a periodic way? If not, when we want to do something, we have to figure out which link directions we need this time and check the health state of them before we do real work.  



  
                
> Master should register a slave only after it confirms it can talk to the slave
> ------------------------------------------------------------------------------
>
>                 Key: MESOS-304
>                 URL: https://issues.apache.org/jira/browse/MESOS-304
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Vinod Kone
>
> We have seen this issue from users running on EC2 and also at Twitter.
> The crux of the issue is that, the master starts offering the resources of a slave as soon as it gets a Register message. If for some reason the master --> slave connection is not viable (e.g. slave used its private ip address, DNS failures), we end up in a loop as follows:
> --> Slave sends Register message to master
> --> Master accepts it and offers resources to the framework
> --> The slave health checks to the slave keeps failing
> --> Framework launches tasks on this slave, which would be dropped on the floor
> --> After health check timeout (>60s), master disconnects the slave
> --> Slave sends a Register message again.
> --> Repeat
> One way to solve this problem is to do a 3-way handshake for registration.
> This should also be done for framework registration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira