You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2014/07/31 15:43:39 UTC
[jira] [Resolved] (YARN-2283) RM failed to release the AM container

     [ https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe resolved YARN-2283.
------------------------------

    Resolution: Duplicate

Yes, it is very likely a duplicate of MAPREDUCE-5888, especially since it no longer reproduces on later releases.  Resolving as a duplicate.

The RM is not failing to release the container, rather the RM is intentionally giving the AM some time to clean things up after unregistering (i.e.: the FINISHING state).  Unfortunately before MAPREDUCE-5888 was fixed the AM could hang during a failed job because of a non-daemon thread that was lingering around and preventing the JVM from shutting down.  The RM eventually decides that the AM has used too much time to cleanup and kills it.

> RM failed to release the AM container
> -------------------------------------
>
>                 Key: YARN-2283
>                 URL: https://issues.apache.org/jira/browse/YARN-2283
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>         Environment: NM1: AM running
> NM2: Map task running
> mapreduce.map.maxattempts=1
>            Reporter: Nishan Shetty
>            Priority: Critical
>
> During container stability test i faced this problem
> While job is running map task got killed
> Observe that eventhough application is FAILED MRAppMaster process is running till timeout because RM did not release  the AM container
> {code}
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1405318134611_0002_01_000005 Container Transitioned from RUNNING to COMPLETED
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1405318134611_0002_01_000005 in state: COMPLETED event:FINISHED
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1405318134611_0002	CONTAINERID=container_1405318134611_0002_01_000005
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Finish information of container container_1405318134611_0002_01_000005 is written
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: Stored the finish data of container container_1405318134611_0002_01_000005
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Released container container_1405318134611_0002_01_000005 of capacity <memory:1024, vCores:1> on host HOST-10-18-40-153:45026, which currently has 1 containers, <memory:2048, vCores:1> used and <memory:6144, vCores:7> available, release resources=true
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:2048, vCores:1> numContainers=1 user=testos user-resources=<memory:2048, vCores:1>
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1405318134611_0002_01_000005, NodeId: HOST-10-18-40-153:45026, NodeHttpAddress: HOST-10-18-40-153:45025, Resource: <memory:1024, vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2048, vCores:1>, usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster=<memory:8192, vCores:8>
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used=<memory:2048, vCores:1> cluster=<memory:8192, vCores:8>
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2048, vCores:1>, usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1
> 2014-07-14 14:43:33,899 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1405318134611_0002_000001 released container container_1405318134611_0002_01_000005 on node: host: HOST-10-18-40-153:45026 #containers=1 available=6144 used=2048 with event: FINISHED
> 2014-07-14 14:43:34,924 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1405318134611_0002_000001 with final state: FINISHING
> 2014-07-14 14:43:34,924 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1405318134611_0002_000001 State change from RUNNING to FINAL_SAVING
> 2014-07-14 14:43:34,924 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1405318134611_0002 with final state: FINISHING
> 2014-07-14 14:43:34,947 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: NodeDataChanged with state:SyncConnected for path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/appattempt_1405318134611_0002_000001 for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2014-07-14 14:43:34,947 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1405318134611_0002 State change from RUNNING to FINAL_SAVING
> 2014-07-14 14:43:34,947 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1405318134611_0002
> 2014-07-14 14:43:34,947 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1405318134611_0002_000001 State change from FINAL_SAVING to FINISHING
> 2014-07-14 14:43:35,012 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: NodeDataChanged with state:SyncConnected for path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002 for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2014-07-14 14:43:35,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1405318134611_0002 State change from FINAL_SAVING to FINISHING
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)