You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "lujie (JIRA)" <ji...@apache.org> on 2019/01/25 13:30:00 UTC
[jira] [Updated] (YARN-9238) We get a wrong attempt by an
appAttemptId when AM crash at some point
[ https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lujie updated YARN-9238:
------------------------
Summary: We get a wrong attempt by an appAttemptId when AM crash at some point (was: An Data Race can make we get a wrong attempt by an appAttemptId)
> We get a wrong attempt by an appAttemptId when AM crash at some point
> ----------------------------------------------------------------------
>
> Key: YARN-9238
> URL: https://issues.apache.org/jira/browse/YARN-9238
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: lujie
> Priority: Critical
>
> We have foud a data race that can make an odd situation.
> See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate:
> {code:java}
> // Allocate OPPORTUNISTIC containers.
> 171. SchedulerApplicationAttempt appAttempt =
> 172. ((AbstractYarnScheduler)rmContext.getScheduler())
> 173. .getApplicationAttempt(appAttemptId);
> 174.
> 175. OpportunisticContainerContext oppCtx =
> 176. appAttempt.getOpportunisticContainerContext();
> 177. oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0)just before line171, when the code of line 171~173 continue to execute to get the appAttempt by appattempt_0, the appAttempt should represents the currenct AM. But we found that the appAttempt represents to the new AM and its attempid is appattempt_1. This appAttempt that represents the new AM has not init its oppCtx, so NPE happnes at line 177.
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> We have found the reason about we use old appattempt_0 but get the new appAttempt that represent to new AM. Below is the function body of getApplicationAttempt at line 173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
> 400 SchedulerApplication<T> app = applications.get(
> 401 applicationAttemptId.getApplicationId());
> 402 return app == null ? null : app.getCurrentAppAttempt();
> 403 }
> {code}
> when old AM Crash, the CurrentAppAttempt of app will be setted as the new appAttempt that presentes the new AM. So the code line 402 will return the new appAttempt.
> We shoud add the check: whether the the getted appAttempt have the same id as given id.
> patch comes soon!
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org