You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2015/05/01 04:10:06 UTC

[jira] [Resolved] (MESOS-2679) Slave asked to shut down by master because 'health check timed out'

     [ https://issues.apache.org/jira/browse/MESOS-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler resolved MESOS-2679.
------------------------------------
    Resolution: Not A Problem

The slave needs to be run under a process which will restart it if it terminates. We currently don't provide such a watchdog process. 

As for the health check failure. The current timeout is 75 seconds. Once the timeout elapsed and the shutdown message was sent, it looks like it took approximately 12 seconds for the message to reach the slave, which seems to indicate there may have been an actual network issue here that led to the health check failure. The master will shut down slaves that it cannot communicate with, so this is to be expected.

> Slave asked to shut down by master because 'health check timed out'
> -------------------------------------------------------------------
>
>                 Key: MESOS-2679
>                 URL: https://issues.apache.org/jira/browse/MESOS-2679
>             Project: Mesos
>          Issue Type: Bug
>          Components: isolation
>    Affects Versions: 0.22.1
>            Reporter: Littlestar
>
> I run spark 1.3.1 on mesos 0.22.1 rc6 (linux64), some mesos slave node offline.....
> slave node logs:
> I0430 15:12:12.737057 32354 slave.cpp:571] Slave asked to shut down by master@192.168.1.10:5050 because 'health check timed out'
> master node logs:
> I0430 15:12:00.615777 19759 master.cpp:237] Shutting down slave 20150430-141442-1214949568-5050-19747-S2 due to health check timeout
> W0430 15:12:00.616083 19751 master.cpp:3417] Shutting down slave 20150430-141442-1214949568-5050-19747-S2 at slave(1)@192.168.1.15:5051 (hpblade05) with message 'health check timed out'
> why master-slave offline and not restart itself? 
> Any configurations to increase this timeout interval?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)