You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Gour Saha (JIRA)" <ji...@apache.org> on 2018/07/27 01:33:00 UTC

[jira] [Commented] (YARN-8579) New AM attempt could not retrieve previous attempt component data

    [ https://issues.apache.org/jira/browse/YARN-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559121#comment-16559121 ] 

Gour Saha commented on YARN-8579:
---------------------------------

I investigated this issue and figured that the root cause is the missing NM tokens corresponding to the containers which were passed to the AM after registration via the onContainersReceivedFromPreviousAttempts callback. This is required with the change made in YARN-6168. Exception seen in AM log is as below -

{code}
2018-07-26 23:22:31,373 [pool-5-thread-4] ERROR instance.ComponentInstance - [COMPINSTANCE httpd-proxy-0 : container_e15_1532637883791_0001_01_000004] Failed to get container status on ctr-e138-1518143905142-412155-01-000005.hwx.site:25454, will try again
org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for ctr-e138-1518143905142-412155-01-000005.hwx.site:25454
	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:262)
	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:252)
	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:137)
	at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:323)
	at org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:596)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{code}

> New AM attempt could not retrieve previous attempt component data
> -----------------------------------------------------------------
>
>                 Key: YARN-8579
>                 URL: https://issues.apache.org/jira/browse/YARN-8579
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Yesha Vora
>            Assignee: Gour Saha
>            Priority: Critical
>
> Steps:
> 1) Launch httpd-docker
> 2) Wait for app to be in STABLE state
> 3) Run validation for app (It takes around 3 mins)
> 4) Stop all Zks 
> 5) Wait 60 sec
> 6) Kill AM
> 7) wait for 30 sec
> 8) Start all ZKs
> 9) Wait for application to finish
> 10) Validate expected containers of the app
> Expected behavior:
> New attempt of AM should start and docker containers launched by 1st attempt should be recovered by new attempt.
> Actual behavior:
> New AM attempt starts. It can not recover 1st attempt docker containers. It can not read component details from ZK. 
> Thus, it starts new attempt for all containers.
> {code}
> 2018-07-19 22:42:47,595 [main] INFO  service.ServiceScheduler - Registering appattempt_1531977563978_0015_000002, fault-test-zkrm-httpd-docker into registry
> 2018-07-19 22:42:47,611 [main] INFO  service.ServiceScheduler - Received 1 containers from previous attempt.
> 2018-07-19 22:42:47,642 [main] INFO  service.ServiceScheduler - Could not read component paths: `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components': No such file or directory: KeeperErrorCode = NoNode for /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components
> 2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Handling container_e08_1531977563978_0015_01_000003 from previous attempt
> 2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Record not found in registry for container container_e08_1531977563978_0015_01_000003 from previous attempt, releasing
> 2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO  impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019
> 2018-07-19 22:42:47,651 [main] INFO  service.ServiceScheduler - Triggering initial evaluation of component httpd
> 2018-07-19 22:42:47,652 [main] INFO  component.Component - [INIT COMPONENT httpd]: 2 instances.
> 2018-07-19 22:42:47,652 [main] INFO  component.Component - [COMPONENT httpd] Requesting for 2 container(s){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org