You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith (JIRA)" <ji...@apache.org> on 2014/12/15 12:02:13 UTC

[jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby

    [ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246555#comment-14246555 ] 

Rohith commented on YARN-2340:
------------------------------

Some thoughts for fixing this issue either of below 2 are
1. Straight away invoke KILL event for application if application is submitting into STOPPED queue during recovering applications. KILL event smothly transition RMApp/RMAppAttempt to KILLED state. But throw exception while killing master container since Either NM's were not registered to RM OR "Connection Refused" when NM is down.
{{CS#addApplication}}
{code}
   // Submit to the queue
    try {
      queue.submitApplication(applicationId, user, queueName);
    } catch (AccessControlException ace) {
      LOG.info("Failed to submit application " + applicationId + " to queue "
          + queueName + " from user " + user, ace);
      if (isAppRecovering) {
        LOG.info("Killing the application " + applicationId);
        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.KILL));
      } else {
        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppRejectedEvent(applicationId, ace.toString()));
      }
      return;
    }
{code}

{{CS#addApplicationAttempt}}
{code}
    SchedulerApplication<FiCaSchedulerApp> application =
        applications.get(applicationAttemptId.getApplicationId());
    if (application == null && isAttemptRecovering) {
      LOG.info("Attempt is recovering from an application where Queue is stopped."
          + applicationAttemptId);
      return;
    }
{code}

2. Introduce new event type like APP_RECOVERY_FAILED  or APP_SCHEDULER_RECOVERY_FAILED and trigger from Scheduler if app is submitted to stopped queue while recovering. Transitions would be like below
AppAttempt : {{NEW to LAUNCHED}}
App : {{NEW to ACCEPTED}}
App : {{ACCEPTED to FINAL_SAVING}} on event APP_RECOVERY_FAILED  or APP_SCHEDULER_RECOVERY_FAILED 
AppAttempt : {{LAUNCHED to FINAL_SAVING}}
AppAttempt : {{FINAL_SAVING to FAILED}}
App : {{FINAL_SAVING to FAILED}}

Please give your suggestions/thoughts.


> NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2340
>                 URL: https://issues.apache.org/jira/browse/YARN-2340
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.4.1
>         Environment: Capacityscheduler with Queue a, b
>            Reporter: Nishan Shetty
>            Assignee: Rohith
>            Priority: Critical
>
> While job is in progress make Queue  state as STOPPED and then restart RM 
> Observe that standby RM fails to come up as acive throwing below NPE
> 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_000002 State change from NEW to SUBMITTED
> 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568)
>  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916)
>  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101)
>  at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602)
>  at java.lang.Thread.run(Thread.java:662)
> 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)