You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Craig Welch (JIRA)" <ji...@apache.org> on 2014/07/16 03:22:05 UTC

[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

    [ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062965#comment-14062965 ] 

Craig Welch commented on YARN-1198:
-----------------------------------

It seems like the related problem with these group of jiras is mostly around when the cluster is resource constrained/has a small number of large jobs using most of the resources it can get into deadlock scenarios.  In addition to fixes for the specific behaviors I think it would be worthwhile to do a min of the calculated headroom against "cluster headroom" as a sanity check, cluster headroom being the total cluster resource - utilized resources.  I've attached a partial patch for that.  This will not help with the application blacklist case (1680) but it would help with 1857 and 2008 (it doesn't correct the mistake in headroom calculation, but it should prevent it from causing a deadlock).  (That's not to say we should not also fix the individual issues, just that this might be a good "catch all" for others we aren't aware of / the problem generally).  I'm attaching an initial pass at doing this (it's just the basics to see if the direction makes sense, not a finished product). 

> Capacity Scheduler headroom calculation does not work as expected
> -----------------------------------------------------------------
>
>                 Key: YARN-1198
>                 URL: https://issues.apache.org/jira/browse/YARN-1198
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Omkar Vinit Joshi
>            Assignee: Omkar Vinit Joshi
>
> Today headroom calculation (for the app) takes place only when
> * New node is added/removed from the cluster
> * New container is getting assigned to the application.
> However there are potentially lot of situations which are not considered for this calculation
> * If a container finishes then headroom for that application will change and should be notified to the AM accordingly.
> * If a single user has submitted multiple applications (app1 and app2) to the same queue then
> ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom.
> ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom.
> ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue).
> * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change.
> * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..)
> * Also  when admin user refreshes queue headroom has to be updated.
> These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.2#6252)