You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Tao Yang (JIRA)" <ji...@apache.org> on 2018/07/19 09:39:00 UTC

[jira] [Commented] (YARN-8546) A reserved container might be released multiple times under async scheduling

    [ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549058#comment-16549058 ] 

Tao Yang commented on YARN-8546:
--------------------------------

Thanks [~cheersyang] for catching this problem.
The detailed process is as follows:
(1) app reserved three containers (container_A/container_B/container_C) and now queue-used is equal with the queue-max-capacity
(2) Async-scheduling-thread-1 allocated container_X and took container_A on node1 as to-released container, async-scheduling-thread-2 allocated container_Y on node2 and also took container_A as to-released container
(3) container_X proposal was accepted, now queue-used is still equal with the queue-max-capacity because the allocated resource is equal with the released resource.
(4) container_Y proposal was accepted too, now queue-used is exceed the queue-max-capacity because scheduler allocated a container but released none (found container_A already released when try to release again)
(5) After that scheduler can't allocate any container for this app, the allocation proposals are always rejected because queue-used is exceed the queue-max-capacity.
The key of this problem is step(4), scheduler should not accept the container_Y proposal which try to release a outdated reserved container. 
Attached v1 patch for review.

> A reserved container might be released multiple times under async scheduling
> ----------------------------------------------------------------------------
>
>                 Key: YARN-8546
>                 URL: https://issues.apache.org/jira/browse/YARN-8546
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacity scheduler
>    Affects Versions: 3.1.0
>            Reporter: Weiwei Yang
>            Assignee: Tao Yang
>            Priority: Major
>              Labels: global-scheduling
>         Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps requesting containers until it uses up cluster available resource. My cluster has 70200 vcores, and each task it applies for 100 vcores, I was expecting total 702 containers can be allocated but eventually there was only 701. The last container could not get allocated because queue used resource is updated to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org