You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith (JIRA)" <ji...@apache.org> on 2014/12/15 12:02:13 UTC
[jira] [Commented] (YARN-2340) NPE thrown when RM restart after
queue is STOPPED. There after RM can not recovery application's and remain
in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246555#comment-14246555 ]
Rohith commented on YARN-2340:
------------------------------
Some thoughts for fixing this issue either of below 2 are
1. Straight away invoke KILL event for application if application is submitting into STOPPED queue during recovering applications. KILL event smothly transition RMApp/RMAppAttempt to KILLED state. But throw exception while killing master container since Either NM's were not registered to RM OR "Connection Refused" when NM is down.
{{CS#addApplication}}
{code}
// Submit to the queue
try {
queue.submitApplication(applicationId, user, queueName);
} catch (AccessControlException ace) {
LOG.info("Failed to submit application " + applicationId + " to queue "
+ queueName + " from user " + user, ace);
if (isAppRecovering) {
LOG.info("Killing the application " + applicationId);
this.rmContext.getDispatcher().getEventHandler()
.handle(new RMAppEvent(applicationId, RMAppEventType.KILL));
} else {
this.rmContext.getDispatcher().getEventHandler()
.handle(new RMAppRejectedEvent(applicationId, ace.toString()));
}
return;
}
{code}
{{CS#addApplicationAttempt}}
{code}
SchedulerApplication<FiCaSchedulerApp> application =
applications.get(applicationAttemptId.getApplicationId());
if (application == null && isAttemptRecovering) {
LOG.info("Attempt is recovering from an application where Queue is stopped."
+ applicationAttemptId);
return;
}
{code}
2. Introduce new event type like APP_RECOVERY_FAILED or APP_SCHEDULER_RECOVERY_FAILED and trigger from Scheduler if app is submitted to stopped queue while recovering. Transitions would be like below
AppAttempt : {{NEW to LAUNCHED}}
App : {{NEW to ACCEPTED}}
App : {{ACCEPTED to FINAL_SAVING}} on event APP_RECOVERY_FAILED or APP_SCHEDULER_RECOVERY_FAILED
AppAttempt : {{LAUNCHED to FINAL_SAVING}}
AppAttempt : {{FINAL_SAVING to FAILED}}
App : {{FINAL_SAVING to FAILED}}
Please give your suggestions/thoughts.
> NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-2340
> URL: https://issues.apache.org/jira/browse/YARN-2340
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager, scheduler
> Affects Versions: 2.4.1
> Environment: Capacityscheduler with Queue a, b
> Reporter: Nishan Shetty
> Assignee: Rohith
> Priority: Critical
>
> While job is in progress make Queue state as STOPPED and then restart RM
> Observe that standby RM fails to come up as acive throwing below NPE
> 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_000002 State change from NEW to SUBMITTED
> 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602)
> at java.lang.Thread.run(Thread.java:662)
> 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)