You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2007/06/05 17:28:26 UTC

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501591 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

Some early thoughts...

Bottomline: we don't want the reducer and hence the job to get stuck forever. 

The main issue is that when a reducer is stuck in shuffle it's hard to accurately say whether the fault lies at the map (jetty acting weird) or at the reduce or both. Having said that it's pertinent to keep in mind that _normally_ maps are cheaper to re-execute.

Given the above I'd like to propose something along these lines:

a) The reduce maintains a per-map count of fetch failures.

b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to the JobTracker via a new rpc: 
{code:title=JobTracker.java}
public synchronized void notifyFailedFetch(String reduceTaskId, String mapTaskId) {
  // ...
}
{code}

c) The JobTracker maintains a per-map count of failed-fetch notfications, and given a sufficient no. of them (say 2/3?) from *any* reducer (even multiple times from the same reducer) fails the map and re-schedules it elsewhere.
  
  This handles 2 cases: a) Faulty maps are re-executed and b) Corner case where only the last reducer is stuck on a given map and hence the map will have to be re-executed.

d) To counter the case of faulty reduces, we could implement a scheme where the reducer kills itself when it notifies the JobTracker of more than, say 5 unique, faulty fetches. This will ensure that a faulty reducer will not result in the JobTracker spawning maps willy-nilly...

Thoughts?

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.