You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/01/06 17:11:39 UTC

[jira] [Commented] (YARN-4549) Containers stuck in KILLING state

    [ https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085737#comment-15085737 ] 

Jason Lowe commented on YARN-4549:
----------------------------------

Did the kill occur shortly after the container was started?  I'm wondering if the pid file somehow appeared _after_ the attempt to kill.  What does {{ls -l --full-time}} show for the pid file, and how does that correlate to the timestamps in the NM log?  Also just to verify it's in the right place, where is the pid file located relative to the yarn local directory root?

You mentioned NM recovery is enabled.  Does this only occur on containers that were recovered on NM startup or also for containers that are started and killed within the same NM session?


> Containers stuck in KILLING state
> ---------------------------------
>
>                 Key: YARN-4549
>                 URL: https://issues.apache.org/jira/browse/YARN-4549
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Danil Serdyuchenko
>
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the container-executor with cgroups configuration. Also we have NM recovery enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the NM tries to kill them. The container remains running indefinitely, this causes some duplication as new containers are brought up to replace them. Looking through the logs NM can't seem to get the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping container with container Id: container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user IP=10.51.111.243        OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1448454866800_0023    CONTAINERID=container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container container_1448454866800_0023_01_000005 transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container container_1448454866800_0023_01_000005
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for container_1448454866800_0023_01_000005. Waited for 2000 ms.
> {noformat}
> The PID files for each container seem to be present on the node. We waren't able to consistently replicate this and hoping that someone has come across this before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)