You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "eric baldeschwieler (JIRA)" <ji...@apache.org> on 2006/11/16 20:35:38 UTC

[jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

    [ http://issues.apache.org/jira/browse/HADOOP-442?page=comments#action_12450506 ] 
            
eric baldeschwieler commented on HADOOP-442:
--------------------------------------------

Current proposal:

- Add config variables that points to file containing list of nodes HDFS should expect (slaves file) (optional config)
- Add config variable that points to a file containing a list of excluded nodes (from previous list) (optional config)

- The nameNode reads these files on startup (iff config).  It keeps a list of included nodes and another of excluded nodes.  If the include list is configured, it will be tested when a node registers or heartbeats.  If the node is not on the list, it will be told to shutdown on the response.  If the exclude list is configured, than a node will also be shutdown if listed.

- We will add an admin command to re-read the inclusion and exclusion files

- The job tracker will also read these lists and have a new admin command to reread the files



> slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-442
>                 URL: http://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh, but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, what I'd like is to be able to prevent rogue processes from connecting to the masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list nodes that shouldn't be accessed, and should be ignored if they try to connect. That would also help in configuring the slaves file for a large cluster - I'd list the full range of machines in the cluster, then list the ones that are down in the 'exclude' section

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira