You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2016/03/11 18:01:39 UTC

[jira] [Commented] (YARN-4790) Per user blacklist node for user specific error for container launch failure.

    [ https://issues.apache.org/jira/browse/YARN-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191217#comment-15191217 ] 

Vinod Kumar Vavilapalli commented on YARN-4790:
-----------------------------------------------

I agree with the problem statement but not necessarily the proposal. Please edit the title so that it highlights the problem only so that we can figure out whatever the solution is.

What we need is to *not* penalize applications for system related issues. When YARN finds a node with configuration / permission issues, it should itself take an action to (a) avoid scheduling on that node, (b) alert administrators etc.

Implementing heuristics for app / user level blacklisting to work-around platform problems should be a last-ditch effort. We did that in Hadoop 1 MapReduce as we didn't have clear demarcation between app vs system failures. But that isn't the case with YARN - part of the reason why we never implemented heuristics based per-app blacklisting *in YARN* - we left that completely up to applications.

> Per user blacklist node for user specific error for container launch failure.
> -----------------------------------------------------------------------------
>
>                 Key: YARN-4790
>                 URL: https://issues.apache.org/jira/browse/YARN-4790
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications
>            Reporter: Junping Du
>            Assignee: Junping Du
>
> There are some user specific error for container launch failure, like:
> when enabling LinuxContainerExecutor, but some node doesn't have such user exists, so container launch should get failed with following information:
> {noformat}
> 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1434045496283_0036_000002 State change from LAUNCHED to FAILED 
> 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1434045496283_0036 failed 2 times due to AM Container for 
> appattempt_1434045496283_0036_000002 exited with exitCode: -1000 due to: 
> Application application_1434045496283_0036 initialization failed (exitCode=255) with output: User jdu not found 
> {noformat}
> Obviously, this node is not suitable for launching container for this user's other applications. We need a per user blacklist track mechanism rather than per application now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)