You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Junping Du (JIRA)" <ji...@apache.org> on 2016/01/11 17:00:41 UTC

[jira] [Created] (YARN-4576) Extend blacklist mechanism to protect AM failed multiple times on failure nodes

Junping Du created YARN-4576:
--------------------------------

             Summary: Extend blacklist mechanism to protect AM failed multiple times on failure nodes
                 Key: YARN-4576
                 URL: https://issues.apache.org/jira/browse/YARN-4576
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
            Reporter: Junping Du
            Assignee: Junping Du
            Priority: Critical


Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried to launch containers on a specific node get failed for several times, AM will blacklist this node in future resource asking. This mechanism works fine for normal containers. However, from our observation on behaviors of clusters: if this problematic node launch AM failed, then RM could pickup this problematic node to launch next AM attempts again and again that cause application failure in case other functional nodes are busy. In normal case, the customized healthy checker script cannot be so sensitive to mark node as unhealthy when one or two containers get launched failed. However, in RM side, we can blacklist these nodes for launching AM for a certain time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)