You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jian He (JIRA)" <ji...@apache.org> on 2016/03/11 21:23:55 UTC

[jira] [Commented] (YARN-4794) Distributed shell app gets stuck on stopping containers after App completes

    [ https://issues.apache.org/jira/browse/YARN-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191453#comment-15191453 ] 

Jian He commented on YARN-4794:
-------------------------------

there's a dead lock in NMClient. 
In NMClientImpl#startContainer, it first grabs "startingContainer" lock, and if exception happens, it calls removeStartedContainer which grabs "NMClient" lock. 
On the other hand, if NMClient is stopping in the meantime, it first calls cleanupRunningContainers which takes "NMClient" lock, and then calls stopContainer which takes "container" lock. Thus, deadlock.


> Distributed shell app gets stuck on stopping containers after App completes
> ---------------------------------------------------------------------------
>
>                 Key: YARN-4794
>                 URL: https://issues.apache.org/jira/browse/YARN-4794
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Sumana Sathish
>            Assignee: Jian He
>            Priority: Critical
>
> Distributed shell app gets stuck on stopping containers after App completes with the following exception
> {code:title = app log}
> 15/12/10 14:52:20 INFO distributedshell.ApplicationMaster: Application completed. Stopping running containers
> 15/12/10 14:52:20 WARN ipc.Client: Exception encountered while connecting to the server : java.nio.channels.ClosedByInterruptException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)