You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Hemanth Yamijala (JIRA)" <ji...@apache.org> on 2009/05/12 12:59:45 UTC

[jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status

    [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708395#action_12708395 ] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

We might start working on this soon. I thought it might be a good idea to share current thinking about the specs for this feature, and start a discussion.

In a brief discussion with the team, Eric and Owen, we came up with the following points:

- Provide an ability to the administrator to give a path to a script file that will be periodically run on the tasktracker. The interval of running can be configured.
- The tasktracker would run this in a separate thread and look for the exit code. If the script exits with a non-zero code, this will be reported to the JobTracker.
- Any output from the script (upon error) will be logged and if possible, also displayed on the web UI of the tasktracker.
- The jobtracker will blacklist this node when such a condition is reported.
- It will be a good idea to display the 'unhealthy' nodes on the UI.
- The tasktracker will need to continue to run the script so that if the condition is corrected, it will be reported again to the jobtracker for becoming available.
- A point came up about whether this mechanism can lead to the state of the node toggling. Maybe we can do some hysterisis, but as a follow-up for this jira.
- However, It may also be a good idea to show some stats like how oftern this node was blacklisted in the last 24 hours, and the current status by going to the tasktracker page on the web ui. This might help us decide if it's worthwhile introducing hysterisis.

Does this sound good ? Any other thoughts ?


> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>
> Hadoop must have some mechanism to find the health status of a node . It should run the health check script periodically and if there is any errors, it should black list the node. This will be really helpful when we run static mapred clusters. Else we may have to run some scripts/daemons periodically to find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.