You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Sunil G (JIRA)" <ji...@apache.org> on 2016/10/25 04:46:58 UTC

[jira] [Comment Edited] (YARN-5773) RM recovery too slow due to LeafQueue#activateApplication()

    [ https://issues.apache.org/jira/browse/YARN-5773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604144#comment-15604144 ] 

Sunil G edited comment on YARN-5773 at 10/25/16 4:46 AM:
---------------------------------------------------------

*Issues in Recovery of apps:*
1. activateApplications works under a write lock.
2. If one application is found of overflowing AM resource limit, instead of breaking from loop, we continue and play complete apps from pendingOrderingPolicy. We may need to iterate all apps because we have apps belongs to different partition and pendingOrderingPolicy does not provide any order for apps based on partition.
3. As mentioned by [~bibinchundatt], when each app fails to get activated due to the upper cut of resource  limit, one INFO log is emitted (because *amLimit* is 0). During recovery, this is costly.

[~leftnoteasy] and [~rohithsharma]
bq.If a given app's AM resource amount > AM headroom, should we skip the AM and activate following app which AM resource amount <= AM headroom?
bq.But one point to be considered is for each Node registration, head room changes. So, user head room changes as new node registered. This need to be taken care.
Currently activateApplications is invoked when there is a change in cluster resource. So any change in cluster resource will ensure a call to activateApplications and we can recalculate this headroom. I am not very sure about the suggested map. Will this check be coming before we do the existing AM resource percentage check for queue/partition (not user based) ? OR are we replacing this checks?


was (Author: sunilg):
*Issues in Recovery of apps:*
1. activateApplications works under a write lock.
2. If one application is found of overflowing AM resource limit, instead of breaking from loop, we continue and play complete apps from pendingOrderingPolicy. We may need to iterate all apps because we have apps belongs to different partition and pendingOrderingPolicy does not provide any order for apps based on partition.
3. As mentioned by [~bibinchundatt], when each app fails to get activated due to the upper cut of resource  limit, one INFO log is emitted. During recovery, this is costly.

[~leftnoteasy] and [~rohithsharma]
bq.If a given app's AM resource amount > AM headroom, should we skip the AM and activate following app which AM resource amount <= AM headroom?
bq.But one point to be considered is for each Node registration, head room changes. So, user head room changes as new node registered. This need to be taken care.
Currently activateApplications is invoked when there is a change in cluster resource. So any change in cluster resource will ensure a call to activateApplications and we can recalculate this headroom. I am not very sure about the suggested map. Will this check be coming before we do the existing AM resource percentage check for queue/partition (not user based) ? OR are we replacing this checks?

> RM recovery too slow due to LeafQueue#activateApplication()
> -----------------------------------------------------------
>
>                 Key: YARN-5773
>                 URL: https://issues.apache.org/jira/browse/YARN-5773
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: YARN-5773.0001.patch, YARN-5773.0002.patch
>
>
> # Submit application 10K application to default queue.
> # All applications are in accepted state
> # Now restart resourcemanager
> For each application recovery {{LeafQueue#activateApplications()}} is invoked.Resulting in AM limit check to be done even before Node managers are getting registered.
> Total iteration for N application is about {{N(N+1)/2}} for {{10K}} application   {{50000000}} iterations causing time take for Rm to be active more than 10 min.
> Since NM resources are not yet added to during recovery we should skip {{activateApplicaiton()}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org