You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Eli Collins (Commented) (JIRA)" <ji...@apache.org> on 2011/10/24 21:46:33 UTC

[jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134413#comment-13134413 ] 

Eli Collins commented on MAPREDUCE-3121:
----------------------------------------

Will failure detection to work similarly to MR1, ie there's a periodic local dir checker thread that marks directories as failed? In MR1 attempts to access local dirs by a container that fail do not necesarily result in the local dir being marked as failed. Since eg a failure that causes a write to a log to fail may not cause the dir checking to fail. If the key places that frequently access local disk - logging and writing intermediate data - handle disk failure than the container can fail-fast and we may identify failures that otherwise go unnoticed.

Will there be any dependency on the NM health checker script?  

Can disk failures be considered transient? Do we want to support disks coming back online? In MR1 a disk failure means the disk is blacklisted until TT restart. 

Is the AM treated like a container as well? Ie it's  allowed to run and fail and the NM will restart it?
 
IIUC the RM doesn't currently consider disks in it's allocation policy, therefore the "major percentage" that's needed to offline the NM should take this into account right? This is similar to the issue in MR1 where we don't throttle slots based on the number of available disks.

What's the plan for testing? Adding some basic fault injection (either manually as in the NN or via a framework) to the commonly used paths would make testing easier than what we do in MR1. 
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.0
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact of transient/permanent disk failures on containers. With larger number of disks per node, the ability to continue to run containers on other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira