You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Bryan Duxbury (JIRA)" <ji...@apache.org> on 2011/01/13 19:05:45 UTC

[jira] Created: (HADOOP-7103) When rack awareness script returns nothing, cluster stops working

When rack awareness script returns nothing, cluster stops working
-----------------------------------------------------------------

                 Key: HADOOP-7103
                 URL: https://issues.apache.org/jira/browse/HADOOP-7103
             Project: Hadoop Common
          Issue Type: Bug
            Reporter: Bryan Duxbury


This was an interesting one. Our rack awareness script contains a 1-1 mapping from host/ip to rack. We added a new rack's worth of machines without updating the awareness script, and when the script was called, it returned absolutely no results for the new machines.

This resulted in the surprising result that basically the entire cluster stopped working. Even tasks or blocks assigned to nodes with a valid rack seemed to fail. The errors were only detectable by looking in the namenode and jobtracker logs, making it take a while before we could figure out the problem. After fixing the rack awareness script, everything returned to normal operation.

It seems to me that either the error should be raised more aggressively, or a "default" rack should be assumed. This would keep simple mistakes from making the entire cluster unusable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-7103) When rack awareness script returns nothing, cluster stops working

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury updated HADOOP-7103:
----------------------------------

    Affects Version/s: 0.20.2

> When rack awareness script returns nothing, cluster stops working
> -----------------------------------------------------------------
>
>                 Key: HADOOP-7103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7103
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: Bryan Duxbury
>
> This was an interesting one. Our rack awareness script contains a 1-1 mapping from host/ip to rack. We added a new rack's worth of machines without updating the awareness script, and when the script was called, it returned absolutely no results for the new machines.
> This resulted in the surprising result that basically the entire cluster stopped working. Even tasks or blocks assigned to nodes with a valid rack seemed to fail. The errors were only detectable by looking in the namenode and jobtracker logs, making it take a while before we could figure out the problem. After fixing the rack awareness script, everything returned to normal operation.
> It seems to me that either the error should be raised more aggressively, or a "default" rack should be assumed. This would keep simple mistakes from making the entire cluster unusable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.