You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Varun Saxena (JIRA)" <ji...@apache.org> on 2017/07/19 22:26:00 UTC

[jira] [Comment Edited] (YARN-6847) NPE in RM while starting timeline collector on recovery after explicit failover

    [ https://issues.apache.org/jira/browse/YARN-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093898#comment-16093898 ] 

Varun Saxena edited comment on YARN-6847 at 7/19/17 10:25 PM:
--------------------------------------------------------------

Found this issue while writing tests for YARN-6130.
NPE is because of RMTimelineCollectorManager object in RMContext being null.

This is because RMTimelineCollectorManager instance is set in active service context inside RMContextImpl but the object for it is created inside ResourceManager#serviceInit. This means if RM is made to transition to standby, active service context will be reset(created again) and RMTimelineCollectorManager object will never be set in it.

This means that when RM subsequently becomes active, during recovery if a timeline collector for a recovered app is to be started, that would fail due to a NPE.


was (Author: varun_saxena):
Found this issue while writing tests for YARN-6130.
NPE is because RMTimelineCollectorManager in RMContext being null.

This is because RMTimelineCollectorManager instance is set in active service context inside RMContextImpl but the object for it is created inside ResourceManager#serviceInit. This means if RM is made to transition to standby, active service context will be reset(created again) and RMTimelineCollectorManager object will never be set in it.

This means that when RM subsequently becomes active, during recovery if a timeline collector for a recovered app is to be started, that would fail due to a NPE.

> NPE in RM while starting timeline collector on recovery after explicit failover
> -------------------------------------------------------------------------------
>
>                 Key: YARN-6847
>                 URL: https://issues.apache.org/jira/browse/YARN-6847
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Varun Saxena
>
> {noformat}
> 2017-07-20 03:20:50,742 ERROR [Thread-449] resourcemanager.ResourceManager (ResourceManager.java:serviceStart(763)) - Failed to load/recover state
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.startTimelineCollector(RMAppImpl.java:535)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:467)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:336)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:576)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1419)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1178)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1218)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1214)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1214)
>         at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:319)
>         at org.apache.hadoop.yarn.client.ProtocolHATestBase.explicitFailover(ProtocolHATestBase.java:205)
>         at org.apache.hadoop.yarn.client.ProtocolHATestBase$1.run(ProtocolHATestBase.java:250)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org