You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Steven Rand (JIRA)" <ji...@apache.org> on 2018/10/18 04:44:00 UTC

[jira] [Created] (YARN-8903) when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node

Steven Rand created YARN-8903:
---------------------------------

             Summary: when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node
                 Key: YARN-8903
                 URL: https://issues.apache.org/jira/browse/YARN-8903
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: nodemanager
    Affects Versions: 3.1.1
            Reporter: Steven Rand


We sometimes experience an issue in which a single application, usually a Spark job, causes at least one node in a YARN cluster to become unhealthy by filling up the local dir(s) on that node past the threshold at which the node is considered unhealthy.

When this happens, the impact is potentially large depending on what else is running on that node, as all containers on that node are lost. Sometimes not much else is running on the node and it's fine, but other times we lose AM containers from other apps and/or non-AM containers with long-running tasks.

I thought that it would be helpful to add an option (default false) whereby if a node is going to become unhealthy due to full local disk(s), it instead identifies the application that's using the most local disk space on that node, and kills that application. (Roughly analogous to how the OOM killer in Linux picks one process to kill rather than letting the machine crash.)

The benefit is that only one application is impacted, and no other application loses any containers. This prevents one user's poorly written code that shuffles/spills huge amounts of data from negatively impacting other users.

The downside is that we're killing the entire application, not just the task(s) responsible for the local disk usage. I believe it's necessary to kill the whole application instead of identifying the container running the relevant task(s), because doing so would require more knowledge of the internal state of aux services responsible for shuffling than what YARN has according to my understanding.

If this seems reasonable, I can work on the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org