You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Manikandan R (JIRA)" <ji...@apache.org> on 2018/06/15 14:22:04 UTC
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

    [ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513874#comment-16513874 ] 

Manikandan R commented on YARN-4606:
------------------------------------

Thanks [~eepayne] for your reviews. I was trying to address "move app" flow also in addition to your review comments, but stuck with it and took more time than expected. Sorry for the delay. 

I stuck with a case,  admin trying to move an app (waiting for am container) from Queue A to Queue B. As part of this, control reaches {{AppScheduling#move}} through {{CapacityScheduler#moveApplication}}. As a first step, we will need to handle activeUsersWithPendingApps count for both queues. For example, After submitting the app to queue inside {{CapacityScheduler#moveApplication}}, we will need to do something like 

{quote}
        //Handle activeUsersWithOnlyPendingApps count appropriately
        if (app.isPending()) \{
          this.getQueue(sourceQueueName).getAbstractUsersManager().
          decrNumActiveUsersWithOnlyPendingApps(user);
          this.getQueue(destQueueName).getAbstractUsersManager().
          incrNumActiveUsersWithOnlyPendingApps(user);
        } {quote}

Then, Inside, {{AppScheduling#move}}, we will need to follow the logic similar to changes in {{AppScheduling#updatePendingResources}} to call {{UsersManager#activateApplications}}. Call to {{AppScheduling#updatePendingResources}} happens as part of Allocate flow every now and then. There is no such periodic calls for Move App. At some point, waitingForAMContainer become false for a given app and call to {{UsersManager#activateApplications}} happens and user got activated in normal app flow. We will need to handle the same even in Move App flow. I was thinking of waiting for some duration (possibly based on average am container allocation time? ) so that chance of getting container for am likely to happen. I am not sure. Attached patch contains this change as well. Please advise. 

Now, coming back to review comments:

1. Yes, it is scheduler specific. [~leftnoteasy] and [~sunilg] Please share your views.
2. For the first cut, I was thinking of fixing this JIRA for CS from end to end. Once fix has been ensured for CS, can apply similar changes to FS as well either with this jira or a different jira. If we are going to address FS related changes in different jira, is it ok to carry the risk you mentioned earlier? Please advise. Either, I can take help from folks who are familiar with FS flow or can hand over to them. Which ever is fine with us.
3. Addressed.

> CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps 
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4606
>                 URL: https://issues.apache.org/jira/browse/YARN-4606
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, capacityscheduler
>    Affects Versions: 2.8.0, 2.7.1
>            Reporter: Karam Singh
>            Assignee: Manikandan R
>            Priority: Critical
>         Attachments: YARN-4606.001.patch, YARN-4606.002.patch, YARN-4606.003.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending (caused by max-am-percent, etc.), ActiveUsersManager still considers the user is an active user. This could lead to starvation of active applications, for example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org