You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Marco Nicosia (JIRA)" <ji...@apache.org> on 2008/04/05 01:32:24 UTC

[jira] Created: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

HOD gracefully exclude "bad" nodes during ring formation
--------------------------------------------------------

                 Key: HADOOP-3184
                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/hod
            Reporter: Marco Nicosia


HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.

This is a frequent HOD user issue (although not directly caused by HOD).

Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.

Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599958#action_12599958 ] 

Hemanth Yamijala commented on HADOOP-3184:
------------------------------------------

There are 2 possible types of issues users are facing:

- Hod allocations fail (that is, the allocate command returns back with a non-zero exit code) due to some of the conditions mentioned above. And retrying doesn't help unless the condition is rectified or the node which has the condition is removed from the resource manager's list. This is particularly true in Torque, as it returns the same set of nodes, in the same order and hence the failure condition is mostly repeated.
OR
- Hod allocation hangs (without returning back), again due to some of the conditions mentioned.

Firstly, can you please confirm which one is more of the issue ?

AFAIK, the second case is a Torque issue where we do not even get control to do anything. We could attempt to fix the first one - maybe even outside of HOD. Maybe we could offline a node if HOD allocations fail a couple of times on it. So, in an automated manner, the offending node is removed, and further attempts would work. 


> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Release Note: Running through Hudson.
    Hadoop Flags: [Reviewed]
          Status: Patch Available  (was: Open)

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3184:
------------------------------------

    Release Note: Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. Introduced new property ringmaster.max-master-failures to specify the maximum number of times a master is allowed to fail.  (was: Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. These retries are done a configured number of times per master. The change is incompatible because a new required configuration option is introduced: ringmaster.max-master-failures, which defines the maximum number of times a master is allowed to fail.)
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602955#action_12602955 ] 

Hadoop QA commented on HADOOP-3184:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383534/3184.2.patch
  against trunk revision 663841.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2602/console

This message is automatically generated.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599960#action_12599960 ] 

Hemanth Yamijala commented on HADOOP-3184:
------------------------------------------

Another approach is the following:

Mostly, Hod allocations fail if the RingMaster does not come up or the JobTracker does not come up. If the JobTracker does not come up, then the hodring on the node can report a failure, and another node which asks for the hadoop command can be asked to run the JT. If the RingMaster does not come up, its a bit more difficult - because that's what controls the whole process. So, maybe in that case, the RingMaster should somehow make another instance of it to come up on a different machine and then it should die gracefully. 

I think the latter change would be quite involved. The former should be simpler.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Attachment: 3184.2.patch

Patch addressing some of Mahadev's comments.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602948#action_12602948 ] 

Hemanth Yamijala commented on HADOOP-3184:
------------------------------------------

bq. I presume you mean in the case where some machines failed, but the cluster eventually came up, right ? Because otherwise, we do print a report on the command line for the users that the hodring on this machine failed due to this reason. The services folks could then check the ringmaster log to see what other machines failed.

In an offline conversation I had with Mahadev, I actually found that he had meant the latter, which is supported. So, all is good. The utility of the other feature remains, though it can be done as an enhancement at a later state.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Status: Patch Available  (was: Open)

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Release Note: Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. These retries are done a configured number of times per master. The change is incompatible because a new required configuration option is introduced: ringmaster.max-master-failures, which defines the maximum number of times a master is allowed to fail.
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed])

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602962#action_12602962 ] 

Hemanth Yamijala commented on HADOOP-3184:
------------------------------------------

Core test failure is unrelated to the patch

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602670#action_12602670 ] 

Hadoop QA commented on HADOOP-3184:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383452/3184.1.patch
  against trunk revision 663487.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2589/console

This message is automatically generated.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Attachment: 3184.1.patch

Patch that addresses issue of JobTracker or NameNode failure

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-3184:
--------------------------------

      Resolution: Fixed
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Hemanth!

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Release Note:   (was: Running through Hudson.)

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602929#action_12602929 ] 

Hemanth Yamijala commented on HADOOP-3184:
------------------------------------------

Mahadev, thank you for the review.

bq. 1) shouldRetryMasterLaunch is defined but is not used anywhere

This is removed now.

bq. 2) you might want to wrap around most the statements since they exceed 80 character columns.

Also done.

bq. 3) is there something we can report back to the user on the command line that some machines are faultty - CRITICAL contact admin? it would be really helpful if we can do that .

I presume you mean in the case where some machines failed, but the cluster eventually came up, right ? Because otherwise, we do print a report on the command line for the users that the hodring on this machine failed due to this reason. The services folks could then check the ringmaster log to see what other machines failed. 

If you meant the former (i.e. the case of eventual success), I agree that it would be useful feature to have. However, it would take more work to build this functionality into the client. I propose we leave this as such for now, and make the enhancement in a later release.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Fix Version/s: 0.18.0

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>             Fix For: 0.18.0
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602797#action_12602797 ] 

Mahadev konar commented on HADOOP-3184:
---------------------------------------

1) shouldRetryMasterLaunch is defined but is not used anywhere
2) you might want to wrap around most the statements since they exceed 80 character columns.
3) is there something we can report back to the user on the command line that some machines are faultty -- CRITICAL contact admin? it would be really helpful if we can do that .

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3184:
-------------------------------------

    Status: Open  (was: Patch Available)

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602614#action_12602614 ] 

Hemanth Yamijala commented on HADOOP-3184:
------------------------------------------

The attached patch solves the problem of cluster allocation failing due to a single bad JobTracker node in the entire cluster. It does not handle ringmaster failures, which is much tougher to solve at this point.

Description of the solution:

This patch builds on the solution of HADOOP-3464, where we introduced an RPC message (setHodRingErrors) which the HodRing will call when they fail to launch the Hadoop daemons on a node (for e.g. because of a missing Hadoop). In HADOOP-3464, upon receiving this error, we checked if the error came while launching a Master command (i.e. a NameNode or JobTracker command) and if so, we simply propagated that back to the client which deallocated the cluster after displaying the error message from the hodring.

In this patch, we keep track of how many times such master commands failed in a variable in the service object. We also introduce a config variable, ringmaster.max-master-failures. The RingMaster returns an error to the client only when the number of times the master command fails exceeds the configured value. If the number is not exceeded, the next HodRing which asks for a command to launch is given out the master command again.

The config variable ringmaster.max-master-failures is bounded by a function of the maximum number of requested nodes, in case they are fewer than the configured value. This is so that the cluster allocation can fail if sufficient nodes are not available to bring up masters anymore.



> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala reassigned HADOOP-3184:
----------------------------------------

    Assignee: Hemanth Yamijala

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.