You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rick Moritz (JIRA)" <ji...@apache.org> on 2019/08/13 12:39:00 UTC

[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

    [ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906148#comment-16906148 ] 

Rick Moritz commented on YARN-1902:
-----------------------------------

This bug actually also can cause application crashes, if the application handles "ContainerAllocated"-events by stockpiling them, and then scheduling tasks to these containers as they arrive. This usually leads to timeouts of the involved token, and very interesting guesswork, why program logic is attempting to launch containers that have been assigned obsolete tokens.

I also wonder how this mixes with the recent addition of "opportunistic allocation".

Hadoop 3 would have been a great opportunity to close this bug :(

> Allocation of too many containers when a second request is done with the same resource capability
> -------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1902
>                 URL: https://issues.apache.org/jira/browse/YARN-1902
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.2.0, 2.3.0, 2.4.0
>            Reporter: Sietse T. Au
>            Assignee: Sietse T. Au
>            Priority: Major
>              Labels: client
>         Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map<Resource, ResourceRequestInfo> is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers received.
> The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org