You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith (JIRA)" <ji...@apache.org> on 2014/11/24 18:01:13 UTC

[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()

    [ https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223136#comment-14223136 ] 

Rohith commented on YARN-2025:
------------------------------

        I ran into weird scenario where I got the NPE in {{CapacityScheduler.addApplicationAttempt}} in a different manner. I could able to get some informationf from the logs but not fully since log were rolled out.

        Application final state is FAILED but ApplicationAttempt final state is null. This looks very strange that how can RMApp->FAILED but RMAppAttempt->null..?
Extracted log from RM is below. Because of this scenario, application recovery throw NPE since RMAppAttempt tries to add attempt to scheduler but application details are not added to schedulers.
{noformat}
2014-11-24 23:53:32,608 | INFO  | main-EventThread | Recovering app: application_1416805604019_0038 with 1 attempts and final state = FAILED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:700)
2014-11-24 23:53:32,609 | INFO  | main-EventThread | Recovering attempt: appattempt_1416805604019_0038_000001 with final state: null | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:735)
{noformat}

NPE trace as follows.
{noformat}
2014-11-24 23:53:32,610 | ERROR | main-EventThread | Failed to load/recover state | org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:527)
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:607)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:941)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:97)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:963)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:931)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:698)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:803)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:95)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:825)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:808)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:681)
        at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:335)
        at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1148)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:523)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:927)
{noformat}

> Possible NPE in schedulers#addApplicationAttempt()
> --------------------------------------------------
>
>                 Key: YARN-2025
>                 URL: https://issues.apache.org/jira/browse/YARN-2025
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Tsuyoshi OZAWA
>            Assignee: Tsuyoshi OZAWA
>         Attachments: YARN-2025.1.patch
>
>
> In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we don't check whether {{application}} is null. This can cause NPE in following sequences: addApplication() -> doneApplication() (e.g. AppKilledTransition) -> addApplicationAttempt().
> {code}
>     SchedulerApplication application =
>         applications.get(applicationAttemptId.getApplicationId());
>     String user = application.getUser();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)