You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Karam Singh (Created) (JIRA)" <ji...@apache.org> on 2011/11/23 14:47:40 UTC
[jira] [Created] (MAPREDUCE-3463) AM fails to restart when first AM
is killed
AM fails to restart when first AM is killed
-------------------------------------------
Key: MAPREDUCE-3463
URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: applicationmaster, mrv2
Affects Versions: 0.23.1
Reporter: Karam Singh
Priority: Blocker
Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
First Ran Randowriter/Sort/Sort-validate successfully
Then again sort, when job was 50% complete
Login node running AppMaster, and killed AppMaster with kill -9
On Client side failed with following:
{code}
11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
RemoteTrace:
at Local Trace:
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
at org.apache.hadoop.examples.Sort.run(Sort.java:181)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.examples.Sort.main(Sort.java:192)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
{code}
On lookig RM logs found second AM was also lauched, it was saying -:
{code}
011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
{code}
Now looking at AM logs and found Second AM was shutdown gracefully due to :-
{code}
2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:619)
2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160749#comment-13160749 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Hdfs-0.23-Commit #237 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/237/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files :
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Mahadev konar (Assigned) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mahadev konar reassigned MAPREDUCE-3463:
----------------------------------------
Assignee: Siddharth Seth
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159918#comment-13159918 ]
Mahadev konar commented on MAPREDUCE-3463:
------------------------------------------
Sid,
Can we add a test case to make sure the Recovery Service doesnt crash and burn?
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160742#comment-13160742 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Hdfs-trunk-Commit #1423 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1423/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160743#comment-13160743 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Common-0.23-Commit #239 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/239/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files :
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156272#comment-13156272 ]
Siddharth Seth commented on MAPREDUCE-3463:
-------------------------------------------
Related to MAPREDUCE-3450.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Priority: Blocker
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160897#comment-13160897 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Mapreduce-0.23-Build #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/112/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files :
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158080#comment-13158080 ]
Mahadev konar commented on MAPREDUCE-3463:
------------------------------------------
Sid,
This patch can go in as it is? Or need MR-3450 first? Can you explain the relation?
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160600#comment-13160600 ]
Siddharth Seth commented on MAPREDUCE-3463:
-------------------------------------------
bq. Would help to change the InlineDispatcher as well as it is used in various places. There is a default ctor for DrainDispatcher that needs to change as well.
Constructors in Inline/DranDispatcher to support the exit flag ? Had just left them to behave the way they currently do - exit in case of an exception. Was just fixing testRecovery and the RecoveryDispatcher in this patch.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160871#comment-13160871 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Hdfs-HAbranch-build #2 (See [https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/2/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files :
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160585#comment-13160585 ]
Hadoop QA commented on MAPREDUCE-3463:
--------------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12505698/MR3463_v2.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1375//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1375//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1375//console
This message is automatically generated.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160760#comment-13160760 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Mapreduce-trunk-Commit #1373 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1373/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hitesh Shah (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160591#comment-13160591 ]
Hitesh Shah commented on MAPREDUCE-3463:
----------------------------------------
Would help to change the InlineDispatcher as well as it is used in various places. There is a default ctor for DrainDispatcher that needs to change as well.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160730#comment-13160730 ]
Mahadev konar commented on MAPREDUCE-3463:
------------------------------------------
+1 looks good to me.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158966#comment-13158966 ]
Siddharth Seth commented on MAPREDUCE-3463:
-------------------------------------------
MR 3450 adds the port information to TaskAttemptInfo which is used by this patch.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160881#comment-13160881 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Hdfs-trunk #881 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/881/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160911#comment-13160911 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Mapreduce-trunk #914 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/914/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------
Attachment: MR3463_v1.txt
Trivial patch to use the port while constructing the NodeId.
Requires MR-3450. Unit tests for the port are part of 3450 - none included in this patch.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160751#comment-13160751 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Common-trunk-Commit #1350 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1350/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Mahadev konar (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mahadev konar updated MAPREDUCE-3463:
-------------------------------------
Status: Patch Available (was: Open)
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------
Status: Patch Available (was: Open)
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Karam Singh (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karam Singh updated MAPREDUCE-3463:
-----------------------------------
Summary: Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job (was: AM fails to restart when first AM is killed)
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Priority: Blocker
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158095#comment-13158095 ]
Hadoop QA commented on MAPREDUCE-3463:
--------------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12504978/MR3463_v1.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
-1 javac. The patch appears to cause tar ant target to fail.
-1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed the unit tests build
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1346//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1346//console
This message is automatically generated.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hitesh Shah (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160604#comment-13160604 ]
Hitesh Shah commented on MAPREDUCE-3463:
----------------------------------------
Ignore my previous comment. The above of changing the defaults for the inline and drain dispatchers to not exit could be addressed in MAPREDUCE-3489.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160761#comment-13160761 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Mapreduce-0.23-Commit #249 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/249/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files :
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------
Status: Open (was: Patch Available)
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------
Attachment: MR3463_v2.txt
TestRecovery was faling silently without the previous patch (System.exit in the dispatcher).
This patch changes the test to cause a unit test failure if the dispatcher catches an exception.
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Mahadev konar (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mahadev konar updated MAPREDUCE-3463:
-------------------------------------
Resolution: Fixed
Fix Version/s: 0.23.1
Hadoop Flags: Reviewed
Status: Resolved (was: Patch Available)
I just committed this. Thanks Sid!
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover
properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160865#comment-13160865 ]
Hudson commented on MAPREDUCE-3463:
-----------------------------------
Integrated in Hadoop-Hdfs-0.23-Build #94 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/94/])
MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk
mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files :
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3463
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1
> Reporter: Karam Singh
> Assignee: Siddharth Seth
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job: map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job: map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace:
> at Local Trace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
> at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
> at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira