You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2010/07/24 00:39:50 UTC

[jira] Created: (MAPREDUCE-1966) Fix tracker blacklisting

Fix tracker blacklisting 
-------------------------

                 Key: MAPREDUCE-1966
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1966
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobtracker
            Reporter: Arun C Murthy


The current heuristic of rolling up fixed number of job failures per tracker isn't working well, we need better design/heuristics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (MAPREDUCE-1966) Fix tracker blacklisting

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Roelofs reassigned MAPREDUCE-1966:
---------------------------------------

    Assignee: Greg Roelofs

> Fix tracker blacklisting 
> -------------------------
>
>                 Key: MAPREDUCE-1966
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1966
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Greg Roelofs
>
> The current heuristic of rolling up fixed number of job failures per tracker isn't working well, we need better design/heuristics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1966) Fix tracker blacklisting

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Roelofs updated MAPREDUCE-1966:
------------------------------------

    Attachment: MR-1966.v1.trunk-hadoop-mapreduce.patch

Initial patch; only minimal testing so far.

Still working on TestTaskTrackerBlacklisting.java, which requires some care.

> Fix tracker blacklisting 
> -------------------------
>
>                 Key: MAPREDUCE-1966
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1966
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Greg Roelofs
>         Attachments: MR-1966.v1.trunk-hadoop-mapreduce.patch
>
>
> The current heuristic of rolling up fixed number of job failures per tracker isn't working well, we need better design/heuristics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1966) Fix tracker blacklisting

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891905#action_12891905 ] 

Greg Roelofs commented on MAPREDUCE-1966:
-----------------------------------------

There's an ambiguity between sick nodes (typically due to failing hardware, either hard drive or memory or occasionally NIC/network switch) and nodes that have been rendered unresponsive due to user abuse.  The existing blacklist heuristics touch on this, but they're a bit ad hoc, and there's not much visibility on the internal state at any given time.

One improvement would be to track the per-node, per-job blacklisting history in a sliding window that's divided into buckets of some suitable granularity.  Bad hardware would tend to show up as an elevated fault level on one node (or a few nodes) for an extended period--i.e., multiple buckets--while abusive jobs would tend to show up as a spike (ideally) or at least a limited-duration jump in faults (one or a few buckets) across many nodes.

Because the heuristics are open to argument even among experts (which would not include me), and because automatic, hardcoded blacklisting has the potential to wipe out a good fraction of a cluster for the wrong reasons, it would seem best to convert the heuristic form of blacklisting to an advisory mode (i.e., "graylisting") until the behavior is better understood.

> Fix tracker blacklisting 
> -------------------------
>
>                 Key: MAPREDUCE-1966
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1966
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>            Reporter: Arun C Murthy
>
> The current heuristic of rolling up fixed number of job failures per tracker isn't working well, we need better design/heuristics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1966) Fix tracker blacklisting

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Roelofs updated MAPREDUCE-1966:
------------------------------------

    Attachment: MR-1966.v2.trunk-hadoop-mapreduce.patch.txt

"Reasonably final" patch, pending review, test-patch, etc.  I'm firing off tests in a few minutes.

> Fix tracker blacklisting 
> -------------------------
>
>                 Key: MAPREDUCE-1966
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1966
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Greg Roelofs
>         Attachments: MR-1966.v1.trunk-hadoop-mapreduce.patch, MR-1966.v2.trunk-hadoop-mapreduce.patch.txt
>
>
> The current heuristic of rolling up fixed number of job failures per tracker isn't working well, we need better design/heuristics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.