You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Wendy Chien (JIRA)" <ji...@apache.org> on 2007/01/26 23:44:49 UTC

[jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

    [ https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467944 ] 

Wendy Chien commented on HADOOP-442:
------------------------------------

Here is the latest design (pretty much the same as the previous proposal with a few more details)

1. Adding two new config variables.  By default they will not be configured (commented out of hadoop-default).
  a. hadoop.nodes.include -- file which contains nodes to include.  If this variable is configured, then only nodes on this list will be allowed to register with the namenode 
  b. hadoop.nodes.exclude -- file which contains nodes to exclude.  If this variable is configured, then any node appearing on this list will be denied communication (register, send heartbeats, send block reports, report received blocks, and send error reports) by the namenode, even if it appears in the include file.   

If neither is configured, then any node is allowed to connect.  We currently have a slaves file that is used for slaves.sh.  This file can be used as the file specified by hadoop.nodes.include, but there is no restriction that it needs to be. 

2. Adding a dfsadmin command (refreshNodes) to reread the inclusion/exclusion files.   

3. The files will be read when the NameNode starts up and whenever it gets a refreshNodes command. 

4. JobTracker will use the same config variables to determine TaskTrackers to include/exclude.




> slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh, but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, what I'd like is to be able to prevent rogue processes from connecting to the masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list nodes that shouldn't be accessed, and should be ignored if they try to connect. That would also help in configuring the slaves file for a large cluster - I'd list the full range of machines in the cluster, then list the ones that are down in the 'exclude' section

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.