You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Oleksandr Shevchenko (JIRA)" <ji...@apache.org> on 2018/03/05 11:33:00 UTC
[jira] [Comment Edited] (YARN-7998) RM crashes with NPE during recovering if ACL configuration was changed

    [ https://issues.apache.org/jira/browse/YARN-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385971#comment-16385971 ] 

Oleksandr Shevchenko edited comment on YARN-7998 at 3/5/18 11:32 AM:
---------------------------------------------------------------------

RM failed with NPE during failover if FairScheduler configurations were changed.

An application was not finished yet, so, application final state = null and also, the last app attempt doesn't have the final state too.

2018-02-28 15:50:51,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1517497680557_565955 *with 2 attempts and final state = null*
2018-02-28 15:50:54,761 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1517497680557_565955_000001 with *final state: FAILED*
2018-02-28 15:50:54,766 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1517497680557_565955_000002 with *final state: null*

In my case, an *ACL configuration in fair-scheduler.xml was changed* as a result we no longer have a rights to submit this application.

In FairScheduler#addApplication() we skip it application. We do not add this application to the scheduler application map and send event APP_REJECTED to go an application to the state FAILED.
{code:java}
if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue
    .hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
  String msg = "User " + userUgi.getUserName()
      + " cannot submit applications to queue " + queue.getName()
      + "(requested queuename is " + queueName + ")";
  LOG.info(msg);
  rmContext.getDispatcher().getEventHandler().handle(
      new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED, msg));
  return;
} 
{code}
Then we try to recovery app attempts. When we try to recovery the last app attempt we should check the final state of attempt and the final state of the application (See RMAppAttemptImpl#transition()). As I said before, application final state = null and also, the last app attempt doesn't have the final state too. So, we check RM app current state in method "isAppInFinalState".
{code:java}
public static boolean isAppInFinalState(RMApp rmApp) {
  RMAppState appState = ((RMAppImpl) rmApp).getRecoveredFinalState();
  if (appState == null) {
    appState = rmApp.getState();
  }
  return appState == RMAppState.FAILED || appState == RMAppState.FINISHED
      || appState == RMAppState.KILLED;
}
{code}
For now, the *current state of the application is NEW because the APP_REJECTED event has not been processed yet* as was described by Gergo Repas. *This lead to the wrong decision to recover attempt*. We try to get a user of the application in FairScheduler#addApplicationAttempt and get NPE because the application nod found in the scheduler.
{code:java}
SchedulerApplication<FSAppAttempt> application = applications.get(
    applicationAttemptId.getApplicationId());
String user = application.getUser();
FSLeafQueue queue = (FSLeafQueue) application.getQueue(); //NPE
{code}
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:740)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1327)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1100)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1046)

 

*Ideally, we should process APP_REJECTED event before we try to recovery attempts.* But for now, I didn't find an easy way to do that.

*We can check whether an application is null.* If it true then skip this attempt. The same way as in CapacityScheduler and as was proposed in YARN-2025.

Perhaps, we should open a new ticket for this.
{code:java}
SchedulerApplication<FSAppAttempt> application = applications.get(
    applicationAttemptId.getApplicationId());
if (application == null) {
  LOG.warn("Application " + applicationAttemptId.getApplicationId() +
      " cannot be found in scheduler.");
  return;
}
String user = application.getUser();
{code}
As a result, RM not failed now but we will get InvalidStateTransitonException because APP_REJECTED event will be processed too late.
{noformat}
2018-02-28 16:00:24,847 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: APP_REJECTED at ACCEPTED.*
{noformat}
If we also add transition from ACCEPTED state to FAILED to the RMAppImpl StateMachineFactory
{code:java}
.addTransition(RMAppState.ACCEPTED, RMAppState.FINAL_SAVING,
RMAppEventType.APP_REJECTED,
new FinalSavingTransition(new AppRejectedTransition(),
RMAppState.FAILED))

{code}
the application will be failed correctly but we get the same problem with attempt:

2018-03-01 16:26:23,899 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: ATTEMPT_FAILED at FAILED*

Perhaps, we can kill this application to avoid exceptions related to an invalid event.
{code:java}
if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue
    .hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
  String msg;
  if (isAppRecovering) {
    msg = "Application " + applicationId +
        " will be killed since ACL configuration was changed and user " +
        userUgi.getUserName() + " no longer have a rights to submit applications to queue " +
        queue.getName();
    rmContext.getDispatcher().getEventHandler().handle(
        new RMAppEvent(applicationId, RMAppEventType.KILL, msg));
  } else {
    msg = "User " + userUgi.getUserName()
        + " cannot submit applications to queue " + queue.getName()
        + "(requested queuename is " + queueName + ")";
    rmContext.getDispatcher().getEventHandler().handle(
        new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED, msg));
  }
  LOG.info(msg);
  return;
} {code}
{{Thanks for any comments.}}


was (Author: oshevchenko):
RM failed with NPE during failover if FairScheduler configurations were changed.

An application was not finished yet, so, application final state = null and also, the last app attempt doesn't have the final state too.

2018-02-28 15:50:51,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1517497680557_565955 *with 2 attempts and final state = null*
2018-02-28 15:50:54,761 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1517497680557_565955_000001 with *final state: FAILED*
2018-02-28 15:50:54,766 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1517497680557_565955_000002 with *final state: null*

 

In my case, an *ACL configuration in fair-scheduler.xml was changed* as a result we no longer have a rights to submit this application.

In FairScheduler#addApplication() we skip it application. We do not add this application to the scheduler application map and send event APP_REJECTED to go an application to the state FAILED.
{code:java}
if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue
    .hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
  String msg = "User " + userUgi.getUserName()
      + " cannot submit applications to queue " + queue.getName()
      + "(requested queuename is " + queueName + ")";
  LOG.info(msg);
  rmContext.getDispatcher().getEventHandler().handle(
      new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED, msg));
  return;
} 
{code}
Then we try to recovery app attempts. When we try to recovery the last app attempt we should check the final state of attempt and the final state of the application (See RMAppAttemptImpl#transition()). As I said before, application final state = null and also, the last app attempt doesn't have the final state too. So, we check RM app current state in method "isAppInFinalState".

 
{code:java}
public static boolean isAppInFinalState(RMApp rmApp) {
  RMAppState appState = ((RMAppImpl) rmApp).getRecoveredFinalState();
  if (appState == null) {
    appState = rmApp.getState();
  }
  return appState == RMAppState.FAILED || appState == RMAppState.FINISHED
      || appState == RMAppState.KILLED;
}
{code}
 

For now, the *current state of the application is NEW because the APP_REJECTED event has not been processed yet* as was described by Gergo Repas. *This lead to the wrong decision to recover attempt*. We try to get a user of the application in FairScheduler#addApplicationAttempt and get NPE because the application nod found in the scheduler.
{code:java}
SchedulerApplication<FSAppAttempt> application = applications.get(
    applicationAttemptId.getApplicationId());
String user = application.getUser();
FSLeafQueue queue = (FSLeafQueue) application.getQueue(); //NPE
{code}
 

java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:740)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1327)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1100)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1046)

 

*Ideally, we should process APP_REJECTED event before we try to recovery attempts.* But for now, I didn't find an easy way to do that.

*We can check whether an application is null.* If it true then skip this attempt. The same way as in CapacityScheduler and as was proposed in YARN-2025.

Perhaps, we should open a new ticket for this.
{code:java}
SchedulerApplication<FSAppAttempt> application = applications.get(
    applicationAttemptId.getApplicationId());
if (application == null) {
  LOG.warn("Application " + applicationAttemptId.getApplicationId() +
      " cannot be found in scheduler.");
  return;
}
String user = application.getUser();
{code}
As a result, RM not failed now but we will get InvalidStateTransitonException because APP_REJECTED event will be processed too late.
{noformat}
2018-02-28 16:00:24,847 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: APP_REJECTED at ACCEPTED.*
{noformat}
If we also add transition from ACCEPTED state to FAILED to the RMAppImpl StateMachineFactory
{code:java}
.addTransition(RMAppState.ACCEPTED, RMAppState.FINAL_SAVING,
RMAppEventType.APP_REJECTED,
new FinalSavingTransition(new AppRejectedTransition(),
RMAppState.FAILED))

{code}
 

the application will be failed correctly but we get the same problem with attempt:

2018-03-01 16:26:23,899 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: ATTEMPT_FAILED at FAILED*

 

Perhaps, we can kill this application to avoid exceptions related to an invalid event.
{code:java}
if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue
    .hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
  String msg;
  if (isAppRecovering) {
    msg = "Application " + applicationId +
        " will be killed since ACL configuration was changed and user " +
        userUgi.getUserName() + " no longer have a rights to submit applications to queue " +
        queue.getName();
    rmContext.getDispatcher().getEventHandler().handle(
        new RMAppEvent(applicationId, RMAppEventType.KILL, msg));
  } else {
    msg = "User " + userUgi.getUserName()
        + " cannot submit applications to queue " + queue.getName()
        + "(requested queuename is " + queueName + ")";
    rmContext.getDispatcher().getEventHandler().handle(
        new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED, msg));
  }
  LOG.info(msg);
  return;
} {code}
{{Thanks for any comments.}}

> RM crashes with NPE during recovering if ACL configuration was changed
> ----------------------------------------------------------------------
>
>                 Key: YARN-7998
>                 URL: https://issues.apache.org/jira/browse/YARN-7998
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: Oleksandr Shevchenko
>            Priority: Major
>         Attachments: YARN-7998.000.patch
>
>
> RM crashes with NPE during failover because ACL configurations were changed as a result we no longer have a rights to submit an application to a queue.
> Scenario:
>  # Submit an application
>  # Change ACL configuration for a queue that accepted the application so that an owner of the application will no longer have a rights to submit this application.
>  # Restart RM.
> As a result, we get NPE:
> 2018-02-27 18:14:00,968 INFO org.apache.hadoop.service.AbstractService: Service ResourceManager failed in state STARTED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:738)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1286)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116)
> 	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1098)
> 	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1044)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org