You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org> on 2009/05/29 05:48:45 UTC

[jira] Updated: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status

     [ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sreekanth Ramakrishnan updated HADOOP-5478:
-------------------------------------------

    Attachment: hadoop-5478-1.patch

Attaching first cut patch to address the issue: 

The patch does following:

* Patch requires two configuration items to be present in TaskTracker nodes, {{mapred.tasktracker.health_check_script}} and {{mapred.tasktracker.health_check_interval}} the {{mapred.tasktracker.health_check_script}} needs to be absolute path to script file. If the file does not exist when the TT starts up then the monitor is turned off.
* The monitor periodically runs the shell script. It ignores the exit code of the shell script, gets the output from the script, searches for a pattern "ERROR" in the output. 
* If ERROR is present in output, the monitor, sets health of the node as unhealthy and puts entire output as status to be set to JT.
* JT then depending on the value of the health of the node, decides to blacklist or white list the node.
* Attached test case which tests black listing and white listing as per output of the script.

> Provide a node health check script and run it periodically to check the node health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It should run the health check script periodically and if there is any errors, it should black list the node. This will be really helpful when we run static mapred clusters. Else we may have to run some scripts/daemons periodically to find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.