You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Hong Zhiguo (JIRA)" <ji...@apache.org> on 2015/09/18 10:00:10 UTC
[jira] [Updated] (YARN-4181) node blacklist for AM launching

     [ https://issues.apache.org/jira/browse/YARN-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hong Zhiguo updated YARN-4181:
------------------------------
    Description: 
In some cases, a node goes problematic and most launching containers fail on this node, as well as the launching AM containers.
Then this node has more available resource than other nodes in the cluster. The Application whose AM is failing has zero minShareRatio. With fair scheduler, this node is always rated first, and the misfortune Application is also likely rated first. The result is:  attempts of the this application are failing again and again on the same node.

We should avoid such a deadlock situation.

Solution 1: NM could detect the failure rate of containers. If the rate is high, the NM marks itself to unhealthy for a period. But we should be careful not to turn all nodes into unhealthy by a buggy Application. Maybe use failure rate of containers for different Applications.

Solution 2: To have Application level blacklist by AMLauncher, in addition to existing blacklist by AM.

  was:
In some cases, a node goes problematic and most launching containers fail on this node, as well as the launching AM containers.
Then this node has more available resource than other nodes in the cluster. The Application whose AM is failing has zero minShareRatio. With fair scheduler, this node is always rated first, and the misfortune Application is also likely rated first. The result is:  attempts of the this application are failing again and again on the same node.

Solution 1: NM could detect the failure rate of containers. If the rate is high, the NM marks itself to unhealthy for a period. But we should be careful not to turn all nodes into unhealthy by a buggy Application. Maybe use failure rate of containers for different Applications.

Solution 2: To have Application level blacklist by AMLauncher, in addition to existing blacklist by AM.


> node blacklist for AM launching
> -------------------------------
>
>                 Key: YARN-4181
>                 URL: https://issues.apache.org/jira/browse/YARN-4181
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Hong Zhiguo
>            Assignee: Hong Zhiguo
>            Priority: Minor
>
> In some cases, a node goes problematic and most launching containers fail on this node, as well as the launching AM containers.
> Then this node has more available resource than other nodes in the cluster. The Application whose AM is failing has zero minShareRatio. With fair scheduler, this node is always rated first, and the misfortune Application is also likely rated first. The result is:  attempts of the this application are failing again and again on the same node.
> We should avoid such a deadlock situation.
> Solution 1: NM could detect the failure rate of containers. If the rate is high, the NM marks itself to unhealthy for a period. But we should be careful not to turn all nodes into unhealthy by a buggy Application. Maybe use failure rate of containers for different Applications.
> Solution 2: To have Application level blacklist by AMLauncher, in addition to existing blacklist by AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)