You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Yang (JIRA)" <ji...@apache.org> on 2019/04/19 20:46:00 UTC

[jira] [Comment Edited] (YARN-9486) Docker container exited with failure does not get clean up correctly

    [ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822175#comment-16822175 ] 

Eric Yang edited comment on YARN-9486 at 4/19/19 8:45 PM:
----------------------------------------------------------

[~Jim_Brennan] containerAlreadyLaunched is default to false.  If the container has been marked for killing before containerAlreadyLaunched is set (i.e. pid file doesn't exist for a period of 30 seconds), then it will return false, and never set containerAlreadyLaunched to true.  I think I need to update the test case.  It's verifying the wrong method to call.


was (Author: eyang):
[~Jim_Brennan] containerAlreadyLaunched is default to false.  If the container has been marked for killing before containerAlreadyLaunched is set (i.e. pid file doesn't exist for a period of 30 seconds), then it will return false, and never set containerAlreadyLaunched to true.

> Docker container exited with failure does not get clean up correctly
> --------------------------------------------------------------------
>
>                 Key: YARN-9486
>                 URL: https://issues.apache.org/jira/browse/YARN-9486
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 3.2.0
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>            Priority: Major
>         Attachments: YARN-9486.001.patch, YARN-9486.002.patch
>
>
> When docker container encounters error and exit prematurely (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we get messages that look like this:
> {code}
> java.io.IOException: Could not find nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_000007//container_1555111445937_0008_01_000007.pid in any of the directories
> 2019-04-15 20:42:16,454 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1555111445937_0008_01_000007 transitioned from RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Cleaning up container container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,455 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Container container_1555111445937_0008_01_000007 not launched. No cleanup needed to be done
> 2019-04-15 20:42:16,455 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase	OPERATION=Container Finished - Failed	TARGET=ContainerImpl	RESULT=FAILURE	DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE	APPID=application_1555111445937_0008	CONTAINERID=container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,458 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1555111445937_0008_01_000007 transitioned from EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1555111445937_0008_01_000007 from application application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,458 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1555111445937_0008_01_000007 for log-aggregation
> 2019-04-15 20:42:16,804 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Getting container-status for container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,804 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Getting localization status for container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,804 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Returning ContainerStatus: [ContainerId: container_1555111445937_0008_01_000007, ExecutionType: GUARANTEED, State: COMPLETE, Capability: <memory:1024, vCores:1>, Diagnostics: ..., ExitStatus: -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1555111445937_0008_01_000007]
> 2019-04-15 20:43:50,476 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1555111445937_0008_01_000007
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org