You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2014/07/22 23:12:39 UTC

[jira] [Resolved] (MAPREDUCE-481) Improvements to Global Black-listing of TaskTrackers

     [ https://issues.apache.org/jira/browse/MAPREDUCE-481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allen Wittenauer resolved MAPREDUCE-481.
----------------------------------------

    Resolution: Fixed

Closing this out as stale.

Blacklisting had a lot more work done to it in 1.x plus this isn't relevant for 2.x anymore.

> Improvements to Global Black-listing of TaskTrackers
> ----------------------------------------------------
>
>                 Key: MAPREDUCE-481
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-481
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>
> HADOOP-4305 added a global black-list of tasktrackers.
> We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!
> The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)