You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Naganarasimha G R (JIRA)" <ji...@apache.org> on 2016/12/27 19:00:59 UTC

[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved container

    [ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781054#comment-15781054 ] 

Naganarasimha G R commented on YARN-6029:
-----------------------------------------

Thanks for working on the patch [~Tao Yang], Actually whats happening is inversion of the lock order, in one flow we are holding the lock of the Leaf and trying to get the lock of the parent (completedContainer flow) and in the other we are holding the lock of the Parent and then trying to get the lock of the Leaf (getQueueUserAclInfo flow) . Better solution to this would be to have as per the 2.9/trunk where in read locks are introduced such that *getQueueUserAclInfo* uses read lock and *completedContainer* uses write lock. so that we can avoid the this inversion. This would be a big change but i am not completely sure about your fix as you are only removing from the LeafQueue

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved container
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6029
>                 URL: https://issues.apache.org/jira/browse/YARN-6029
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Blocker
>         Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls YarnClient#getQueueAclsInfo) just at the moment that LeafQueue#assignContainers is called and before notifying parent queue to release resource (should release a reserved container), then ResourceManager can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue root), iterates over children queue acls and is blocked when calling synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being completed and is blocked when invoking synchronized ParentQueue#internalReleaseResource method (the ParentQueue instance lock of queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be removed to solve this problem, since this method appears to not affect fields of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org