You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jon Bringhurst (JIRA)" <ji...@apache.org> on 2014/06/27 22:30:25 UTC
[jira] [Created] (YARN-2223) NPE on ResourceManager recover

Jon Bringhurst created YARN-2223:
------------------------------------

             Summary: NPE on ResourceManager recover
                 Key: YARN-2223
                 URL: https://issues.apache.org/jira/browse/YARN-2223
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.4.1
            Reporter: Jon Bringhurst


I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461).

Both clusters have the same config (other than hostnames). Both are running on JDK8u5 (I'm not sure if this is a factor here).

One cluster started up without any errors. The other started up with the following error on the RM:

{noformat}
18:33:45,463  WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead.
18:33:45,465  INFO RMAppImpl:651 - Recovering app: application_1398450350082_0001 with 8 attempts and final state = KILLED
18:33:45,468  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000001 with final state: KILLED
18:33:45,478  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000002 with final state: FAILED
18:33:45,478  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000003 with final state: FAILED
18:33:45,479  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000004 with final state: FAILED
18:33:45,479  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000005 with final state: FAILED
18:33:45,480  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000006 with final state: FAILED
18:33:45,480  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000007 with final state: FAILED
18:33:45,481  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000008 with final state: FAILED
18:33:45,482  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000001 State change from NEW to KILLED
18:33:45,482  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000002 State change from NEW to FAILED
18:33:45,482  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000003 State change from NEW to FAILED
18:33:45,482  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000004 State change from NEW to FAILED
18:33:45,483  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000005 State change from NEW to FAILED
18:33:45,483  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000006 State change from NEW to FAILED
18:33:45,483  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000007 State change from NEW to FAILED
18:33:45,483  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000008 State change from NEW to FAILED
18:33:45,485  INFO RMAppImpl:639 - application_1398450350082_0001 State change from NEW to KILLED
18:33:45,485  WARN RMAppImpl:331 - The specific max attempts: 0 for application: 2 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead.
18:33:45,485  INFO RMAppImpl:651 - Recovering app: application_1398450350082_0002 with 8 attempts and final state = KILLED
18:33:45,486  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000001 with final state: KILLED
18:33:45,486  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000002 with final state: FAILED
18:33:45,487  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000003 with final state: FAILED
18:33:45,487  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000004 with final state: FAILED
18:33:45,488  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000005 with final state: FAILED
18:33:45,488  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000006 with final state: FAILED
18:33:45,489  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000007 with final state: FAILED
18:33:45,489  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000008 with final state: FAILED
18:33:45,490  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000001 State change from NEW to KILLED
18:33:45,490  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000002 State change from NEW to FAILED
18:33:45,490  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000003 State change from NEW to FAILED
18:33:45,490  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000004 State change from NEW to FAILED
18:33:45,491  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000005 State change from NEW to FAILED
18:33:45,491  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000006 State change from NEW to FAILED
18:33:45,491  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000007 State change from NEW to FAILED
18:33:45,491  INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000008 State change from NEW to FAILED
18:33:45,491  INFO RMAppImpl:639 - application_1398450350082_0002 State change from NEW to KILLED
18:33:45,492  WARN RMAppImpl:331 - The specific max attempts: 0 for application: 33 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead.
18:33:45,492  INFO RMAppImpl:651 - Recovering app: application_1401811496082_0033 with 2 attempts and final state = null
18:33:45,492  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1401811496082_0033_000001 with final state: FAILED
18:33:45,492  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1401811496082_0033_000002 with final state: null
18:33:45,493  INFO RMAppAttemptImpl:659 - appattempt_1401811496082_0033_000001 State change from NEW to FAILED
18:33:45,493  INFO RMAppAttemptImpl:659 - appattempt_1401811496082_0033_000002 State change from NEW to LAUNCHED
18:33:45,494  INFO RMAppImpl:639 - application_1401811496082_0033 State change from NEW to ACCEPTED
18:33:45,494  WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead.
18:33:45,494  INFO RMAppImpl:651 - Recovering app: application_1398453545406_0001 with 9 attempts and final state = null
18:33:45,495  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000001 with final state: FAILED
18:33:45,495  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000002 with final state: FAILED
18:33:45,496  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000003 with final state: FAILED
18:33:45,496  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000004 with final state: FAILED
18:33:45,496  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000005 with final state: FAILED
18:33:45,497  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000006 with final state: FAILED
18:33:45,497  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000007 with final state: FAILED
18:33:45,498  INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000008 with final state: FAILED
18:33:45,499 ERROR ResourceManager:488 - Failed to load/recover state
java.lang.NullPointerException
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
18:33:45,500  INFO AbstractService:272 - Service RMActiveServices failed in state STARTED; cause: java.lang.NullPointerException
java.lang.NullPointerException
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
18:33:45,501  INFO MetricsSystemImpl:200 - Stopping ResourceManager metrics system...
18:33:45,502  INFO MetricsSystemImpl:206 - ResourceManager metrics system stopped.
18:33:45,502  INFO MetricsSystemImpl:572 - ResourceManager metrics system shutdown complete.
18:33:45,502  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,503  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,503  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,503  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
18:33:45,505  INFO AbstractService:272 - Service ResourceManager failed in state STARTED; cause: java.lang.NullPointerException
java.lang.NullPointerException
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
18:33:45,505  INFO ResourceManager:891 - Transitioning to standby state
18:33:45,505  INFO ResourceManager:901 - Transitioned to standby state
18:33:45,505 FATAL ResourceManager:1042 - Error starting ResourceManager
java.lang.NullPointerException
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
18:33:45,509  INFO ResourceManager:640 - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down ResourceManager at xxxxxmy_server_hostname/x.x.x.x
************************************************************/
{noformat}

When attempting to startup this cluster after failure, it crashed again with:

{noformat}
19:19:05,662  INFO AMRMTokenSecretManager:107 - Rolling master-key for amrm-tokens
19:19:05,665  INFO RMContainerTokenSecretManager:103 - Rolling master-key for container-tokens
19:19:05,665  INFO NMTokenSecretManagerInRM:95 - Rolling master-key for nm-tokens
19:19:05,665  INFO RMContainerTokenSecretManager:108 - Going to activate master-key with key-id 1885856529 in 135000ms
19:19:05,665  INFO NMTokenSecretManagerInRM:100 - Going to activate master-key with key-id 1756560776 in 135000ms
19:19:35,971  INFO RMDelegationTokenSecretManager:96 - removing master key with keyID 86
19:19:35,971  INFO FileSystemRMStateStore:484 - Removing RMDelegationKey_86
19:19:35,972  INFO AbstractDelegationTokenSecretManager:223 - Updating the current master key for generating delegation tokens
19:19:35,972  INFO RMDelegationTokenSecretManager:85 - storing master key with keyID 94
19:19:35,973  INFO FileSystemRMStateStore:473 - Storing RMDelegationKey_94
19:21:20,666  INFO RMContainerTokenSecretManager:139 - Activating next master key with id: 1885856529
19:21:20,666  INFO NMTokenSecretManagerInRM:131 - Activating next master key with id: 1756560776
16:14:06,403 ERROR ResourceManager:60 - RECEIVED SIGNAL 15: SIGTERM
16:14:06,408  INFO log:67 - Stopped SelectChannelConnector@0.0.0.0:8088
16:14:06,510  INFO Server:2399 - Stopping server on 8032
16:14:06,511  INFO Server:694 - Stopping IPC Server listener on 8032
16:14:06,511  INFO Server:820 - Stopping IPC Server Responder
16:14:06,511  INFO Server:2399 - Stopping server on 8033
16:14:06,512  INFO Server:694 - Stopping IPC Server listener on 8033
16:14:06,512  INFO Server:820 - Stopping IPC Server Responder
16:14:06,512  INFO ResourceManager:890 - Transitioning to standby state
16:14:06,513  INFO MetricsSystemImpl:200 - Stopping ResourceManager metrics system...
16:14:06,516  INFO MetricsSystemImpl:206 - ResourceManager metrics system stopped.
16:14:06,516  INFO MetricsSystemImpl:572 - ResourceManager metrics system shutdown complete.
16:14:06,516  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,517  WARN ApplicationMasterLauncher:98 - org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread interrupted. Returning.
16:14:06,518  INFO Server:2399 - Stopping server on 8030
16:14:06,520  INFO Server:694 - Stopping IPC Server listener on 8030
16:14:06,520  INFO Server:820 - Stopping IPC Server Responder
16:14:06,520  INFO Server:2399 - Stopping server on 8031
16:14:06,521  INFO Server:694 - Stopping IPC Server listener on 8031
16:14:06,521  INFO Server:820 - Stopping IPC Server Responder
16:14:06,522 ERROR ResourceManager:586 - Returning, interrupted : java.lang.InterruptedException
16:14:06,522  INFO AbstractLivelinessMonitor:127 - NMLivelinessMonitor thread interrupted
16:14:06,522  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,524  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,524  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,525  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,525  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,525  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,526  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,526  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,526  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,527  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events.
16:14:06,527  INFO AbstractLivelinessMonitor:127 - AMLivelinessMonitor thread interrupted
16:14:06,531 ERROR AbstractDelegationTokenSecretManager:557 - InterruptedExcpetion recieved for ExpiredTokenRemover thread java.lang.InterruptedException: sleep interrupted
16:14:06,531  INFO AbstractLivelinessMonitor:127 - org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted
16:14:06,527  INFO AbstractLivelinessMonitor:127 - AMLivelinessMonitor thread interrupted
16:14:06,532  INFO ResourceManager:900 - Transitioned to standby state
16:14:06,532  INFO ResourceManager:640 - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down ResourceManager at xxxx_my_server_hostname/x.x.x.x
************************************************************/
{noformat}

Subsequent startups result in an error that appears similar.

Before I try to wipe the state of this cluster, is there any debug info you'd like me to gather?

Note that this warning is being shown in the above, I haven't gotten around to fixing it yet. I'm not sure if it's related to the crash.

{noformat}
18:33:45,463  WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)