You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wangda Tan (JIRA)" <ji...@apache.org> on 2016/05/11 23:54:13 UTC

[jira] [Commented] (YARN-5074) RM cycles through container ids for an app that is waiting for resources.

    [ https://issues.apache.org/jira/browse/YARN-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281017#comment-15281017 ] 

Wangda Tan commented on YARN-5074:
----------------------------------

Thanks [~sidharta-s] reporting this issue.  

This happens when:

- A multiple nodes cluster (# >= 2)
- App1 takes almost of the cluster
- AM Request of app2 can be reserved but cannot get allocated
- If app2 get resource from a different node other than reserved node (IAW, reservation cancellation happens). App2 can get a container-id with number > 1.

From what I can see, there're two issues that container id could be skipped when works with reservation-continuous-looking:

*Issue#1, multiple containerId will be skipped*
In LeafQueue#assignContainer 
{code}
    // Create the container if necessary
    Container container = 
        getContainer(rmContainer, application, node, capability, priority);
{code}

Happens before successfully allocate or reserve container.

So if LeafQueue relaxed checks considered reserved resource, it is possible that unnecessary getContainer call happens.

This issue only exists in branch-2.7. Branch-2.8/branch-2/trunk will not create containerId unless it allocate or reserve new container.

*Issue#2, single container id will be skipped:*
This issue exists in both of branch-2.7 and branch-2.8+.

When one container (c1) is reserved at host1, and later it is cancelled to allocate another container (c2) at a different host, containerId of c1 will be skipped.

Uploading a demo test to reproduce this issue in branch-2.7:

> RM cycles through container ids for an app that is waiting for resources. 
> --------------------------------------------------------------------------
>
>                 Key: YARN-5074
>                 URL: https://issues.apache.org/jira/browse/YARN-5074
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.2
>            Reporter: Sidharta Seethana
>         Attachments: YARN-5074-test-case.patch
>
>
> /cc [~wangda], [~vinodkv]
> This was observed on a cluster running a 2.7.x build. Here is the scenario :
> 1. A YARN cluster has applications running that almost entirely consume the cluster, with little available resources.
> 2. A new app is submitted - the resources required for the AM exceed what is available in the cluster. The app stays in the 'ACCEPTED' state till resources are available.
> 3. Once resources are available and the AM container comes up, the AM container has a id that indicates that the RM has been cycling through containers. There are no errors in the logs of any kind. One example id for such an AM container is : container_e3788_1462916288781_0012_01_000302 . This indicates that while the app was in the 'ACCEPTED' state, the RM cycled through 301 containers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org