You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/05/09 00:37:48 UTC

[jira] [Created] (MAPREDUCE-4235) Killing app can lead to inconsistent app status between RM and HS

Jason Lowe created MAPREDUCE-4235:
-------------------------------------

             Summary: Killing app can lead to inconsistent app status between RM and HS
                 Key: MAPREDUCE-4235
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4235
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.3
            Reporter: Jason Lowe


If a client tries to kill an application that is about to complete, the application states between the ResourceManager's web UI and the history server can be inconsistent.  When the problem occurs, the ResourceManager shows the Status/FinalStatus as KILLED/KILLED and the history link will redirect to a broken link.  The history link still references the ApplicationMaster which is now missing.  The history server entry will show the application state as SUCCEEDED.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4235) Killing app can lead to inconsistent app status between RM and HS

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270908#comment-13270908 ] 

Jason Lowe commented on MAPREDUCE-4235:
---------------------------------------

The ApplicationMaster log will have this exception when it shuts down:

{noformat}
2012-05-08 16:19:34,666 ERROR [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while unregistering 
RemoteTrace: 
 at LocalTrace: 
	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: RemoteTrace: 
 at LocalTrace: 
	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't exist in cache appattempt_1336511902223_0001_000001
	at org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
	at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47)
	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222)
	at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69)
	at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
	at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:124)
	at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668)
	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036)
	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{noformat}

I believe the following sequence of events leads to the problem:

# AM sees all tasks complete, changes internal job state to SUCCEEDED, triggers job finished event (which currently waits 5 seconds and enlarges the race window)
# kill client command connects to the AM, sees that the job state != RUNNING, then tells RM to kill application
# RM fields kill request, transitions app state from RUNNING to KILLED/KILLED and unregisters app.  Leaves tracking URL unchanged (probably should null it out as it does for AM's that exit unexpectedly)
# AM starts shutdown, tries to unregister with RM, and RM claims it doesn't know about the app (because it already unregistered it internally)
# HS reports app status as SUCCEEDED because jhist file shows job completed successfully.

If the RM fields an unregister request for an application that was killed, we may want to consider updating the application's status and tracking URL based on the unregister request since it is likely to be more accurate (e.g.: SUCCEEDED instead of KILLED and tracking URL would point to the history server).
                
> Killing app can lead to inconsistent app status between RM and HS
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4235
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4235
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> If a client tries to kill an application that is about to complete, the application states between the ResourceManager's web UI and the history server can be inconsistent.  When the problem occurs, the ResourceManager shows the Status/FinalStatus as KILLED/KILLED and the history link will redirect to a broken link.  The history link still references the ApplicationMaster which is now missing.  The history server entry will show the application state as SUCCEEDED.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira