You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Haibo Chen (JIRA)" <ji...@apache.org> on 2017/03/01 23:53:45 UTC

[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891327#comment-15891327 ] 

Haibo Chen edited comment on MAPREDUCE-6834 at 3/1/17 11:52 PM:
----------------------------------------------------------------

Thanks for the clarification, [~jlowe]. We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003   MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here
{code:java}
  // Create container token and NMToken altogether, if either of them fails for
  // some reason like DNS unavailable, do not return this container and keep it
  // in the newlyAllocatedContainers waiting to be refetched.
  public synchronized ContainersAndNMTokensAllocation {...}
{code}
I believe this is a duplicate of YARN-3112, so I am going to close this jira as a duplicate. Feel free to reopen it if you disagree.



was (Author: haibochen):
Thanks for the clarification, [~jlowe]. We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003   MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc here
bq.
  // Create container token and NMToken altogether, if either of them fails for
  // some reason like DNS unavailable, do not return this container and keep it
  // in the newlyAllocatedContainers waiting to be refetched.
  public synchronized ContainersAndNMTokensAllocation {...}

I believe this is a duplicate of YARN-3112, so I am going to close this jira as a duplicate. Feel free to reopen it if you disagree.


> MR application fails with "No NMToken sent" exception after MRAppMaster recovery
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6834
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager, yarn
>    Affects Versions: 2.7.0
>         Environment: Centos 7
>            Reporter: Aleksandr Balitsky
>            Assignee: Aleksandr Balitsky
>            Priority: Critical
>         Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container launch failed for container_1482408247195_0002_02_000011 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for node1:43037
> 	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
> 	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:244)
> 	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
> 	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
> 	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
> 	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted to RMCommunicator in RegisterApplicationMasterResponse  (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in RMCommunicator.register method. RM don't transmit tese tokens again for other allocated requests, but we don't have these tokens in NMTokenCache. Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed 
> I tried to do the same scenario without the commit and application completed successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org