You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Lee young gon (Jira)" <ji...@apache.org> on 2021/09/24 04:56:00 UTC
[jira] [Created] (YARN-10969) After RM fail-over, getContainerStatus fails from ApplicationMaster to NodeManager

Lee young gon created YARN-10969:
------------------------------------

             Summary: After RM fail-over, getContainerStatus fails from ApplicationMaster to NodeManager
                 Key: YARN-10969
                 URL: https://issues.apache.org/jira/browse/YARN-10969
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.1.2
            Reporter: Lee young gon


If the artifact type of yarn-service spec is docker, getContainerStatus is periodically requested through the NMClient.

And when RM fail-over occurs, getContainerStatus fails after a specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2) time.

Then the following log occurs in AM
{code:java}
2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance - [COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004] Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try again javax.security.sasl.SaslException: DIGEST-MD5: digest response format violation. Mismatched response. [Caused by org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.] at sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339) at org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response. at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at org.apache.hadoo
{code}
The overall flow is as follows.
 # Started AM
 # AM requests containers
 # RM assigned a container
 ## RM makes tokens for each NM assigned to NMToken Master Key and delivers them to AM
 ## This NMToken master key is rolling periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs) for each NM. There's a timing issue, but it's the same value as RM
 # NM is assigned a container and stores the NMToken master key (same as the master key used by RM) in the old Master Keys (hashmap) at that point with ApplicationAttemptId as the key
 # After that, requests from AM to NM (getContainerStatus) are made through the issued token
 ** Even if the master key is rolled, the request succeeds because it is stored in NM's oldMasterKeys (stored in NMStateStore)
 # But it becomes a problem if the AM loses that token for any reason(e.g. RM failover, AM restart)
 # For example, when the AM restarts, the AM uses the token created with the NMToken master key at that point and is only effective for a specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) time
 ** If there is an old MasterKey of ApplicationAttemptId in NM's oldMasterKey, currentMasterKey and preliminaryMasterKey are valid, but subsequent tokens are invalid

That is, for any reason, when AM is re-issued with a token, it is only valid for a specific time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org