You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/02/04 20:01:34 UTC
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

    [ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305740#comment-14305740 ] 

Jason Lowe commented on YARN-3136:
----------------------------------

Sample stacktrace:

{noformat}
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getTransferredContainers(AbstractYarnScheduler.java:86)
        - waiting to lock <0x000000023ae075e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
        at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:297)
        - locked <0x00000004cda13f98> (a org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:90)
        at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
{noformat}

>From a cursory glance, it looks like the applications map is always a ConcurrentHashMap in practice, and I think we might be able to find a way to either remove the lock entirely or at least lock something at a more granular level than the big scheduler lock.

> getTransferredContainers can be a bottleneck during AM registration
> -------------------------------------------------------------------
>
>                 Key: YARN-3136
>                 URL: https://issues.apache.org/jira/browse/YARN-3136
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers.  The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call.  We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)