You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Littlestar (JIRA)" <ji...@apache.org> on 2015/05/02 17:50:05 UTC

[jira] [Commented] (MESOS-2679) Slave asked to shut down by master because 'health check timed out'

    [ https://issues.apache.org/jira/browse/MESOS-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525327#comment-14525327 ] 

Littlestar commented on MESOS-2679:
-----------------------------------

>>>which seems to indicate there may have been an actual network issue here that led to the health check failure. 
I am sure network is ok, it's in a blade internal network.

Does the counter is cleared when once health check ok? I see the counter max is 5(MAX_SLAVE_PING_TIMEOUTS,).
Does there has more logs on  "health check failure"? thanks.

In my environment,  'health check timed out' reproduced when spark task runs moren than 30min.
but it works well on yarn or spark standalone.

> Slave asked to shut down by master because 'health check timed out'
> -------------------------------------------------------------------
>
>                 Key: MESOS-2679
>                 URL: https://issues.apache.org/jira/browse/MESOS-2679
>             Project: Mesos
>          Issue Type: Bug
>          Components: isolation
>    Affects Versions: 0.22.1
>            Reporter: Littlestar
>
> I run spark 1.3.1 on mesos 0.22.1 rc6 (linux64), some mesos slave node offline.....
> slave node logs:
> I0430 15:12:12.737057 32354 slave.cpp:571] Slave asked to shut down by master@192.168.1.10:5050 because 'health check timed out'
> master node logs:
> I0430 15:12:00.615777 19759 master.cpp:237] Shutting down slave 20150430-141442-1214949568-5050-19747-S2 due to health check timeout
> W0430 15:12:00.616083 19751 master.cpp:3417] Shutting down slave 20150430-141442-1214949568-5050-19747-S2 at slave(1)@192.168.1.15:5051 (hpblade05) with message 'health check timed out'
> why master-slave offline and not restart itself? 
> Any configurations to increase this timeout interval?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)