You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Tao Yang (JIRA)" <ji...@apache.org> on 2019/05/06 07:07:00 UTC
[jira] [Commented] (YARN-9432) Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled

    [ https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833551#comment-16833551 ] 

Tao Yang commented on YARN-9432:
--------------------------------

Thanks [~cheersyang] for the review. 

The problem is that async-scheduling with multi-nodes enabled will skip allocating when usedCapacity is equal to 1.0 according to CapacityScheduling#allocateContainersOnMultiNodes, excess reservation maybe part of the usedCapacity and break the normal scheduling process.

A simple solution is to trigger unreserve process in this scenario to release excess reserved containers in time. The unreserve logic in CapacityScheduling#allocateContainersOnSingleNode can be extracted to be a dependent function so that it can be reused in CapacityScheduling#allocateContainersOnMultiNodes.

Attached v3 patch for review.

> Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9432
>                 URL: https://issues.apache.org/jira/browse/YARN-9432
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>         Attachments: YARN-9432.001.patch, YARN-9432.002.patch, YARN-9432.003.patch
>
>
> Reserved containers may change to be excess after its request has been cancelled or satisfied, excess reserved containers need to be unreserved quickly to release resource for others.
> For multi-nodes disabled scenario, excess reserved containers can be quickly released in next node heartbeat, the calling stack is CapacityScheduler#nodeUpdate -->  CapacityScheduler#allocateContainersToNode --> CapacityScheduler#allocateContainerOnSingleNode. 
> But for multi-nodes enabled scenario, excess reserved containers have chance to be released only in allocation process, key phase of the calling stack is LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer. According to this, excess reserved containers may not be released until its queue has pending request and has chance to be allocated, and the worst is that excess reserved containers will never be released and keep holding resource if there is no additional pending request for this queue.
> To solve this problem, my opinion is to directly kill excess reserved containers when request is satisfied (in FiCaSchedulerApp#apply) or the allocation number of resource-requests/scheduling-requests is updated to be 0 (in SchedulerApplicationAttempt#updateResourceRequests / SchedulerApplicationAttempt#updateSchedulingRequests).
> Please feel free to give your suggestions. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org