You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Basti Liu (JIRA)" <ji...@apache.org> on 2015/11/04 03:16:27 UTC
[jira] [Comment Edited] (STORM-1155) Supervisor recurring health checks

    [ https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988744#comment-14988744 ] 

Basti Liu edited comment on STORM-1155 at 11/4/15 2:16 AM:
-----------------------------------------------------------

Hi  Thomas，

Thansk for this good issue. 
I am interested in the topic of health checks of supervisor, which is important to avoid unnecessary try of starting new workers in a problem node. 
But for the running workers, it might not be a good choice to kill them when the health check of supervisor is failed. It is very difficult to use external script to check if the running workers are still in a correct status. So I think it is better to still only use heartbeat mechanism to judge if the running workers needs to be killed and re-assigned.


was (Author: basti.lj):
Hi  Thomas，

Thansk for this good issue. 
I am interested in the topic of health checks of supervisor, which is important to avoid unnecessary try of starting new workers in a problem node. 
But for the running workers, it might not be a good choice to kill them when the health check of supervisor is failed. It is very difficult to use external script to check if the running workers are still in a correct status. So I think it is better to still only use heartbeat mechanism to jugde if the running workers needs to be killed and re-assigned.

> Supervisor recurring health checks
> ----------------------------------
>
>                 Key: STORM-1155
>                 URL: https://issues.apache.org/jira/browse/STORM-1155
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin. If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to execute properly so you don't want to mark the node as unhealthy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)