You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Maysam Yabandeh (JIRA)" <ji...@apache.org> on 2015/09/22 07:03:04 UTC
[jira] [Commented] (YARN-4011) Jobs fail since nm-local-dir not
cleaned up when rogue job fills up disk
[ https://issues.apache.org/jira/browse/YARN-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901946#comment-14901946 ]
Maysam Yabandeh commented on YARN-4011:
---------------------------------------
We face this problem quite often in our ad hoc cluster and are thinking to implement some basic checkers to make such misbehaved jobs fail fast.
Until we have a proper solution for yarn, we can have a mapreduce-specific solution in place to protect the cluster from rogue mapreduce tasks? The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above the configured limit. It is true that written bytes is larger than the actual used disk space, but to detect a rogue task the exact value is not required and a very large value for written bytes to local disk is a good indicative that the task is misbehaved.
Thoughts?
> Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk
> ------------------------------------------------------------------------
>
> Key: YARN-4011
> URL: https://issues.apache.org/jira/browse/YARN-4011
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Affects Versions: 2.4.0
> Reporter: Ashwin Shankar
>
> We observed jobs failed since tasks couldn't launch on nodes due to "java.io.IOException No space left on device".
> On digging in further, we found a rogue job which filled up disk.
> Specifically it was wrote a lot of map spills(like attempt_1432082376223_461647_m_000421_0_spill_10000.out) to nm-local-dir causing disk to fill up, and it failed/got killed, but didn't clean up these files in nm-local-dir.
> So the disk remained full, causing subsequent jobs to fail.
> This jira is created to address why files under nm-local-dir doesn't get cleaned up when job fails after filling up disk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)