You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Cindy Li (JIRA)" <ji...@apache.org> on 2014/11/27 00:04:13 UTC

[jira] [Updated] (YARN-2909) RM failover lead to failure to recognition of application attempt

     [ https://issues.apache.org/jira/browse/YARN-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cindy Li updated YARN-2909:
---------------------------
    Description: 
Seeing this from client side (Hive job): 
2014-11-25 10:00:50,179 Stage-3 map = 100%,  reduce = 99%, Cumulative CPU 136560.72 sec
2014-11-25 10:00:54,776 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 42627.6 sec
2014-11-25 10:01:20,097 Stage-3 map = 0%,  reduce = 0%
2014-11-25 10:02:20,348 Stage-3 map = 0%,  reduce = 0%
2014-11-25 10:02:30,702 Stage-3 map = 1%,  reduce = 0%, Cumulative CPU 1511.98 sec




Seeing this on resource manager (rm2):

[14:16]:[hadoop@phxaishdc20en0008-be:logs]# grep container_1416845430616_0009_01_000001 *
hadoop-hadoop-resourcemanager.log.2014-11-25-10:2014-11-25 10:01:09,757 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: received container statuses on node manager register :[ContainerStatus: [ContainerId: container_1416845430616_0009_01_000001, State: COMPLETE, Diagnostics: , ExitStatus: 0, ], ContainerStatus: [ContainerId: container_1416845430616_0014_01_000241, State: COMPLETE, Diagnostics: Container Killed by ResourceManager

Seeing this in container log (Logs for container_1416845430616_0009_01_000001)
2014-11-25 10:59:17,839 INFO [IPC Server handler 2 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 10:59:47,905 INFO [IPC Server handler 26 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 11:00:01,672 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2014-11-25 11:00:17,977 INFO [IPC Server handler 5 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 11:00:36,882 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2014-11-25 11:00:48,044 INFO [IPC Server handler 2 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 11:00:52,901 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2014-11-25 11:00:52,912 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_1416845430616_0009
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager doesn't recognize AttemptId: application_1416845430616_0009
	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:580)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:220)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:268)
	at java.lang.Thread.run(Thread.java:745)
2014-11-25 11:00:52,915 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1416845430616_0009Job Transitioned from RUNNING to REBOOT
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the services
2014-11-25 11:00:52,917 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0
2014-11-25 11:00:52,946 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped JobHistoryEventHandler. super.stop()
2014-11-25 11:00:52,947 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1416845430616_0009_r_000000_0
2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1416845430616_0009_r_000000_0 TaskAttempt Transitioned from RUNNING to KILLED
2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1416845430616_0009_r_000000_1 TaskAttempt Transitioned from NEW to UNASSIGNED
2014-11-25 11:00:52,958 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Final Stats: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:1 CompletedMaps:414 CompletedReds:0 ContAlloc:827 ContRel:411 HostLocal:298 RackLocal:108
2014-11-25 11:00:52,961 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Skipping cleaning up the staging dir. assuming AM will be retried.
2014-11-25 11:00:52,962 INFO [Thread-3928] org.apache.hadoop.ipc.Server: Stopping server on 36552
2014-11-25 11:00:52,968 INFO [IPC Server listener on 36552] org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 36552
2014-11-25 11:00:52,968 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2014-11-25 11:00:52,968 INFO [TaskHeartbeatHandler PingChecker] org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler thread interrupted



  was:
Seeing this from client side (Hive job): 
2014-11-25 10:00:50,179 Stage-3 map = 100%,  reduce = 99%, Cumulative CPU 136560.72 sec
2014-11-25 10:00:54,776 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 42627.6 sec
2014-11-25 10:01:20,097 Stage-3 map = 0%,  reduce = 0%
2014-11-25 10:02:20,348 Stage-3 map = 0%,  reduce = 0%
2014-11-25 10:02:30,702 Stage-3 map = 1%,  reduce = 0%, Cumulative CPU 1511.98 sec




Seeing this resource manager (rm2):

[14:16]:[hadoop@phxaishdc20en0008-be:logs]# grep container_1416845430616_0009_01_000001 *
hadoop-hadoop-resourcemanager.log.2014-11-25-10:2014-11-25 10:01:09,757 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: received container statuses on node manager register :[ContainerStatus: [ContainerId: container_1416845430616_0009_01_000001, State: COMPLETE, Diagnostics: , ExitStatus: 0, ], ContainerStatus: [ContainerId: container_1416845430616_0014_01_000241, State: COMPLETE, Diagnostics: Container Killed by ResourceManager

Seeing this in container log (Logs for container_1416845430616_0009_01_000001)
2014-11-25 10:59:17,839 INFO [IPC Server handler 2 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 10:59:47,905 INFO [IPC Server handler 26 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 11:00:01,672 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2014-11-25 11:00:17,977 INFO [IPC Server handler 5 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 11:00:36,882 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2014-11-25 11:00:48,044 INFO [IPC Server handler 2 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
2014-11-25 11:00:52,901 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2014-11-25 11:00:52,912 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_1416845430616_0009
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager doesn't recognize AttemptId: application_1416845430616_0009
	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:580)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:220)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:268)
	at java.lang.Thread.run(Thread.java:745)
2014-11-25 11:00:52,915 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1416845430616_0009Job Transitioned from RUNNING to REBOOT
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is false
2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the services
2014-11-25 11:00:52,917 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0
2014-11-25 11:00:52,946 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped JobHistoryEventHandler. super.stop()
2014-11-25 11:00:52,947 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1416845430616_0009_r_000000_0
2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1416845430616_0009_r_000000_0 TaskAttempt Transitioned from RUNNING to KILLED
2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1416845430616_0009_r_000000_1 TaskAttempt Transitioned from NEW to UNASSIGNED
2014-11-25 11:00:52,958 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Final Stats: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:1 CompletedMaps:414 CompletedReds:0 ContAlloc:827 ContRel:411 HostLocal:298 RackLocal:108
2014-11-25 11:00:52,961 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Skipping cleaning up the staging dir. assuming AM will be retried.
2014-11-25 11:00:52,962 INFO [Thread-3928] org.apache.hadoop.ipc.Server: Stopping server on 36552
2014-11-25 11:00:52,968 INFO [IPC Server listener on 36552] org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 36552
2014-11-25 11:00:52,968 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2014-11-25 11:00:52,968 INFO [TaskHeartbeatHandler PingChecker] org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler thread interrupted




> RM failover lead to failure to recognition of application attempt
> -----------------------------------------------------------------
>
>                 Key: YARN-2909
>                 URL: https://issues.apache.org/jira/browse/YARN-2909
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Cindy Li
>
> Seeing this from client side (Hive job): 
> 2014-11-25 10:00:50,179 Stage-3 map = 100%,  reduce = 99%, Cumulative CPU 136560.72 sec
> 2014-11-25 10:00:54,776 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 42627.6 sec
> 2014-11-25 10:01:20,097 Stage-3 map = 0%,  reduce = 0%
> 2014-11-25 10:02:20,348 Stage-3 map = 0%,  reduce = 0%
> 2014-11-25 10:02:30,702 Stage-3 map = 1%,  reduce = 0%, Cumulative CPU 1511.98 sec
> Seeing this on resource manager (rm2):
> [14:16]:[hadoop@phxaishdc20en0008-be:logs]# grep container_1416845430616_0009_01_000001 *
> hadoop-hadoop-resourcemanager.log.2014-11-25-10:2014-11-25 10:01:09,757 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: received container statuses on node manager register :[ContainerStatus: [ContainerId: container_1416845430616_0009_01_000001, State: COMPLETE, Diagnostics: , ExitStatus: 0, ], ContainerStatus: [ContainerId: container_1416845430616_0014_01_000241, State: COMPLETE, Diagnostics: Container Killed by ResourceManager
> Seeing this in container log (Logs for container_1416845430616_0009_01_000001)
> 2014-11-25 10:59:17,839 INFO [IPC Server handler 2 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
> 2014-11-25 10:59:47,905 INFO [IPC Server handler 26 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
> 2014-11-25 11:00:01,672 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-11-25 11:00:17,977 INFO [IPC Server handler 5 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
> 2014-11-25 11:00:36,882 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
> 2014-11-25 11:00:48,044 INFO [IPC Server handler 2 on 36552] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1416845430616_0009_r_000000_0 is : 0.9893525
> 2014-11-25 11:00:52,901 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-11-25 11:00:52,912 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_1416845430616_0009
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager doesn't recognize AttemptId: application_1416845430616_0009
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:580)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:220)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:268)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-11-25 11:00:52,915 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1416845430616_0009Job Transitioned from RUNNING to REBOOT
> 2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: false
> 2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: false
> 2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: false
> 2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is false
> 2014-11-25 11:00:52,916 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the services
> 2014-11-25 11:00:52,917 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2014-11-25 11:00:52,946 INFO [Thread-3928] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped JobHistoryEventHandler. super.stop()
> 2014-11-25 11:00:52,947 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1416845430616_0009_r_000000_0
> 2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1416845430616_0009_r_000000_0 TaskAttempt Transitioned from RUNNING to KILLED
> 2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1416845430616_0009_r_000000_1 TaskAttempt Transitioned from NEW to UNASSIGNED
> 2014-11-25 11:00:52,958 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Final Stats: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:1 CompletedMaps:414 CompletedReds:0 ContAlloc:827 ContRel:411 HostLocal:298 RackLocal:108
> 2014-11-25 11:00:52,961 INFO [Thread-3928] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Skipping cleaning up the staging dir. assuming AM will be retried.
> 2014-11-25 11:00:52,962 INFO [Thread-3928] org.apache.hadoop.ipc.Server: Stopping server on 36552
> 2014-11-25 11:00:52,968 INFO [IPC Server listener on 36552] org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 36552
> 2014-11-25 11:00:52,968 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
> 2014-11-25 11:00:52,968 INFO [TaskHeartbeatHandler PingChecker] org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler thread interrupted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)