You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith Sharma K S (JIRA)" <ji...@apache.org> on 2016/05/10 04:39:12 UTC

[jira] [Commented] (YARN-5063) Fail to launch AM continuously on a lost NM

    [ https://issues.apache.org/jira/browse/YARN-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277602#comment-15277602 ] 

Rohith Sharma K S commented on YARN-5063:
-----------------------------------------

bq. If a NM node shuts down, RM will not mark it as LOST until liveness monitor finds it timeout. However before that, RM might continuously allocate AM on that NM.
Is scheduler is enabled with async scheduling mode? In normal way, allocation will happen when there is node heartbeat is received. If node is shutdown, node does not send heartbeat. I am thinking how can RM allocate container to same node if NM is shutdown provided async scheduling mode is not enabled. Am I missing any critical point here?

bq. we could add the NM to AM blacklist if RM failed to launch it.
What is the reason for launch failure? YARN-2005 provide support for blacklisting scheduling AMs node but it has design level issue which would cause issue like YARN-4685

> Fail to launch AM continuously on a lost NM
> -------------------------------------------
>
>                 Key: YARN-5063
>                 URL: https://issues.apache.org/jira/browse/YARN-5063
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>
> If a NM node shuts down, RM will not mark it as LOST until liveness monitor finds it timeout. However before that, RM might continuously allocate AM on that NM.
> We found this case in our cluster: RM continuously allocated a same AM on a lost NM before RM found it lost, and AMLauncher always failed because it could not connect to the lost NM. To solve the problem, we could add the NM to AM blacklist if RM failed to launch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org