You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Junping Du (JIRA)" <ji...@apache.org> on 2016/01/13 19:22:39 UTC
[jira] [Comment Edited] (YARN-4576) Pluggable blacklist/whitelist policies in launching AM

    [ https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096727#comment-15096727 ] 

Junping Du edited comment on YARN-4576 at 1/13/16 6:22 PM:
-----------------------------------------------------------

bq. I think some strict return codes can help here. I havent gone deeper in analyzing this part, however I feel we can have a global blacklisting if its not an app specific launch/container failure.
That's a good point. Some failures like: DISKS_FAILED should belongs to global failure type while KILLED_EXCEEDED_PMEM belongs to AM specific. We need to treat each failure type separately.

bq. But this control will be with applications then. I am not sure how much RM can override this functionality, so some clear definitions can be defined for this.
I am not sure if this is a reasonable concern. AM already can ask resources on specific node for its particular tasks, why it cannot control where itself would like to be scheduled? For RM, the assumption here is just AM really know what it does when setting this whitelist. In implementation, may be we can have configuration to identify if this is only a wish list or a forcefully one?


was (Author: djp):
bq. I think some strict return codes can help here. I havent gone deeper in analyzing this part, however I feel we can have a global blacklisting if its not an app specific launch/container failure.
That's a good point. Some failures like: DISKS_FAILED should belongs to global failure type while KILLED_EXCEEDED_PMEM belongs to AM specific. We need to trade each failure type separately.

bq. But this control will be with applications then. I am not sure how much RM can override this functionality, so some clear definitions can be defined for this.
I am not sure if this is a reasonable concern. AM already can ask resources on specific node for its particular tasks, why it cannot control where itself would like to be scheduled? For RM, the assumption here is just AM really know what it does when setting this whitelist. In implementation, may be we can have configuration to identify if this is only a wish list or a forcefully one?

> Pluggable blacklist/whitelist policies in launching AM
> ------------------------------------------------------
>
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM:  If AM tried to launch containers on a specific node get failed for several times, AM will blacklist this node in future resource asking. This mechanism works fine for normal containers. However, from our observation on behaviors of several clusters: if this problematic node launch AM failed, then RM could pickup this problematic node to launch next AM attempts again and again that cause application failure in case other functional nodes are busy. In normal case, the customized healthy checker script cannot be so sensitive to mark node as unhealthy when one or two containers get launched failed. 
> After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes who launching AM attempts failed for specific application before will get blacklisted. To get rid of potential risks that all nodes being blacklisted by BlacklistManager, a disable-failure-threshold is involved to stop adding more nodes into blacklist if hit certain ratio already. 
> There are already some enhancements for this AM blacklist mechanism: YARN-4284 is to address the more wider case for AM container get launched failure and YARN-4389 tries to make configuration settings available for change by App to meet app specific requirement. However, there are still several gaps to address more scenarios:
> 1. We may need a global blacklist instead of each app maintain a separated one. The reason is: AM could get more chance to fail if other AM get failed before. A quick example is: in a busy cluster, all nodes are busy except two problematic nodes: node a and node b, app1 already submit and get failed in two AM attempts on a and b. app2 and other apps should wait for other busy nodes rather than waste attempts on these two problematic nodes.
> 2. If AM container failure is recognized as global event instead app own issue, we should consider the blacklist is not a permanent thing but with a specific time window. 
> 3. We could have user defined black list polices to address more possible cases and scenarios, so it reasonable to make blacklist policy pluggable.
> 4. For some test scenario, we could have whitelist mechanism for AM launching.
> 5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so far.
> Will try to address all issues here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)