You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Karam Singh (Created) (JIRA)" <ji...@apache.org> on 2011/11/23 14:47:40 UTC

[jira] [Created] (MAPREDUCE-3463) AM fails to restart when first AM is killed

AM fails to restart when first AM is killed
-------------------------------------------

                 Key: MAPREDUCE-3463
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mrv2
    Affects Versions: 0.23.1
            Reporter: Karam Singh
            Priority: Blocker


Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
First Ran Randowriter/Sort/Sort-validate successfully
Then again sort, when job was 50% complete
Login node running AppMaster, and killed AppMaster with kill -9
On Client side failed with following:
{code}
11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
RemoteTrace: 
 at Local Trace: 
	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
{code}

On lookig RM logs found second AM was also lauched, it was saying -:
{code}
011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
{code}

Now looking at AM logs and found Second AM was shutdown gracefully due to :-
{code}
2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
        at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
        at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
        at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
        at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
        at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
        at java.lang.Thread.run(Thread.java:619)
2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160749#comment-13160749 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #237 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/237/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Mahadev konar (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar reassigned MAPREDUCE-3463:
----------------------------------------

    Assignee: Siddharth Seth
    
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159918#comment-13159918 ] 

Mahadev konar commented on MAPREDUCE-3463:
------------------------------------------

Sid,
 Can we add a test case to make sure the Recovery Service doesnt crash and burn?
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160742#comment-13160742 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #1423 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1423/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160743#comment-13160743 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #239 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/239/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156272#comment-13156272 ] 

Siddharth Seth commented on MAPREDUCE-3463:
-------------------------------------------

Related to MAPREDUCE-3450.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Priority: Blocker
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160897#comment-13160897 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/112/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158080#comment-13158080 ] 

Mahadev konar commented on MAPREDUCE-3463:
------------------------------------------

Sid,
 This patch can go in as it is? Or need MR-3450 first? Can you explain the relation?

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160600#comment-13160600 ] 

Siddharth Seth commented on MAPREDUCE-3463:
-------------------------------------------

bq. Would help to change the InlineDispatcher as well as it is used in various places. There is a default ctor for DrainDispatcher that needs to change as well.

Constructors in Inline/DranDispatcher to support the exit flag ? Had just left them to behave the way they currently do - exit in case of an exception. Was just fixing testRecovery and the RecoveryDispatcher in this patch.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160871#comment-13160871 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Hdfs-HAbranch-build #2 (See [https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/2/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160585#comment-13160585 ] 

Hadoop QA commented on MAPREDUCE-3463:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12505698/MR3463_v2.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1375//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1375//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1375//console

This message is automatically generated.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160760#comment-13160760 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1373 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1373/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hitesh Shah (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160591#comment-13160591 ] 

Hitesh Shah commented on MAPREDUCE-3463:
----------------------------------------

Would help to change the InlineDispatcher as well as it is used in various places. There is a default ctor for DrainDispatcher that needs to change as well. 
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160730#comment-13160730 ] 

Mahadev konar commented on MAPREDUCE-3463:
------------------------------------------

+1 looks good to me.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158966#comment-13158966 ] 

Siddharth Seth commented on MAPREDUCE-3463:
-------------------------------------------

MR 3450 adds the port information to TaskAttemptInfo which is used by this patch.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160881#comment-13160881 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #881 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/881/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160911#comment-13160911 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #914 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/914/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------

    Attachment: MR3463_v1.txt

Trivial patch to use the port while constructing the NodeId.

Requires MR-3450. Unit tests for the port are part of 3450 - none included in this patch.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160751#comment-13160751 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1350 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1350/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev)

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208994
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Mahadev konar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-3463:
-------------------------------------

    Status: Patch Available  (was: Open)
    
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------

    Status: Patch Available  (was: Open)
    
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Karam Singh (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karam Singh updated MAPREDUCE-3463:
-----------------------------------

    Summary: Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job  (was: AM fails to restart when first AM is killed)
    
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Priority: Blocker
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158095#comment-13158095 ] 

Hadoop QA commented on MAPREDUCE-3463:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12504978/MR3463_v1.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The patch appears to cause tar ant target to fail.

    -1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed the unit tests build

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1346//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1346//console

This message is automatically generated.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hitesh Shah (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160604#comment-13160604 ] 

Hitesh Shah commented on MAPREDUCE-3463:
----------------------------------------

Ignore my previous comment. The above of changing the defaults for the inline and drain dispatchers to not exit could be addressed in MAPREDUCE-3489. 
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160761#comment-13160761 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #249 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/249/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------

    Status: Open  (was: Patch Available)
    
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3463:
--------------------------------------

    Attachment: MR3463_v2.txt

TestRecovery was faling silently without the previous patch (System.exit in the dispatcher).

This patch changes the test to cause a unit test failure if the dispatcher catches an exception.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Mahadev konar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-3463:
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.23.1
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

I just committed this. Thanks Sid!
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160865#comment-13160865 ] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #94 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/94/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira