You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Greg Roelofs (JIRA)" <ji...@apache.org> on 2010/07/24 04:06:50 UTC

[jira] Commented: (MAPREDUCE-1966) Fix tracker blacklisting

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891905#action_12891905 ] 

Greg Roelofs commented on MAPREDUCE-1966:
-----------------------------------------

There's an ambiguity between sick nodes (typically due to failing hardware, either hard drive or memory or occasionally NIC/network switch) and nodes that have been rendered unresponsive due to user abuse.  The existing blacklist heuristics touch on this, but they're a bit ad hoc, and there's not much visibility on the internal state at any given time.

One improvement would be to track the per-node, per-job blacklisting history in a sliding window that's divided into buckets of some suitable granularity.  Bad hardware would tend to show up as an elevated fault level on one node (or a few nodes) for an extended period--i.e., multiple buckets--while abusive jobs would tend to show up as a spike (ideally) or at least a limited-duration jump in faults (one or a few buckets) across many nodes.

Because the heuristics are open to argument even among experts (which would not include me), and because automatic, hardcoded blacklisting has the potential to wipe out a good fraction of a cluster for the wrong reasons, it would seem best to convert the heuristic form of blacklisting to an advisory mode (i.e., "graylisting") until the behavior is better understood.

> Fix tracker blacklisting 
> -------------------------
>
>                 Key: MAPREDUCE-1966
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1966
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>            Reporter: Arun C Murthy
>
> The current heuristic of rolling up fixed number of job failures per tracker isn't working well, we need better design/heuristics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.