You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/03/25 20:47:32 UTC

[jira] Created: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-1158
                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.12.2
            Reporter: Devaraj Das


The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501623 ] 

Doug Cutting commented on HADOOP-1158:
--------------------------------------

This sounds like a good design to me.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521956 ] 

Hadoop QA commented on HADOOP-1158:
-----------------------------------

-1, new javadoc warnings

The javadoc tool appears to have generated warning messages when testing the latest attachment http://issues.apache.org/jira/secure/attachment/12363993/HADOOP-1158_4_20070817.patch against trunk revision r568706.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/596/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/596/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502216 ] 

Doug Cutting commented on HADOOP-1158:
--------------------------------------

Putting all status into a single object sounds like a good approach.  Perhaps the method should be renamed 'status' rather than 'progress'?

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518539 ] 

Hadoop QA commented on HADOOP-1158:
-----------------------------------

+0, new Findbugs warnings

http://issues.apache.org/jira/secure/attachment/12363431/HADOOP-1158_2_20070808.patch
applied and successfully tested against trunk revision r563649,
but there appear to be new Findbugs warnings introduced by this patch.

New Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/530/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/530/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/530/console

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522165 ] 

Enis Soztutar commented on HADOOP-1158:
---------------------------------------

Here is the code in jetty to print the above warning. 
{code}
 log.info("LOW ON THREADS (("+getMaxThreads()+"-"+getThreads()+"+"+getIdleThreads()+")<"+getMinThreads()+") on "+ this); 
{code}

it seems jetty is configured with max threads = 40, isn't it insufficient? 

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502212 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

Oh, it might be worth considering a separate MapTaskStatus and ReduceTaskStatus classes since there are varied pieces of un-related information for Map and Reduce tasks (i.e. shuffle/sort-merge related info, fetch failures etc.) ... we could stick it up in the appropriate 'Task' class too (which the child-vm could then compute and send to the TaskTracker) - perhaps as a separate issue?

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511034 ] 

Devaraj Das commented on HADOOP-1158:
-------------------------------------

One comment - for cases where a reduce fails to fetch a map output for a number of times, it informs the jobtracker, and the jobtracker reexecutes the map. The reduce task also increments a counter associated with the number of failed fetches (which in the patch is used to kill itself when it exceeds a certain number). Now if the reduce can fetch the map output correctly from the new map location, it should decrement that counter.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>         Attachments: HADOOP-1158_20070702_1.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522193 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

Thanks for the review Enis.

So, here is how we solve issues emanating from Jetty: if there are sufficient failures for a given map (say due to Jetty), we just fail the map and re-run it elsewhere, there-by the reducer isn't stuck. Now given sufficient no. of maps fail on the same TaskTracker (say Jetty again) then it gets blacklisted and hence no tasks are assigned to it... does that make sense?

Please feel free to open further issues if you have other thoughts help improve things...

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522248 ] 

Enis Soztutar commented on HADOOP-1158:
---------------------------------------

Yes the TT will be blacklisted and it is sometimes unlikely that the TT will continue its normal computation. But restarting the tt automatically has a chance to recover its state. I think it will improve self managing aspects of the cluster. 



> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501870 ] 

Arun C Murthy edited comment on HADOOP-1158 at 6/6/07 8:27 PM:
---------------------------------------------------------------

bq.b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to the JobTracker via a new rpc:

I take that back.. I propose we augument TaskStatus itself to let the JobTracker know about the failed-fetches i.e. map taskids. 

We could just add an new RPC to TaskUmbilicalProtocol for the reduce-task to let the TaskTracker know about the failed fetch... 
{code:title=TaskUmbilical.java}
void fetchError(String taskId, String failedFetchMapTaskId);
{code}

Even better, a tad more involved, is to rework 
{code:title=TaskUmbilical.java}
  void progress(String taskid, float progress, String state, 
                            TaskStatus.Phase phase, Counters counters)
   throws IOException, InterruptedException;
{code}
as
{code:title=TaskUmbilical.java}
  void progress(String taskid, TaskStatus taskStatus}
   throws IOException, InterruptedException;
{code}

This simplies the flow so that the child-vm itself computes it's {{TaskStatus}} (which will be augumented to contain the failed-fetch-mapIds) and sends it along the {{TaskTracker}} which just forwards it to the {{JobTracker}}, thereby relieving it of some of the responsibilities vis-a-vis computing the {{TaskStatus}}. Clearly this could be linked to the the reporting re-design at HADOOP-1462 ...

Thoughts?


 was:
bq.b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to the JobTracker via a new rpc:

I take that back, I'm propose we use augument TaskStatus itself to let the JobTracker know about the failed-fetches i.e. map taskids, we could just add an new RPC to TaskUmbilicalProtocol for the reduce-task to let the TaskTracker know about the failed fetch.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned HADOOP-1158:
-------------------------------------

    Assignee: Arun C Murthy

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Status: Patch Available  (was: Open)

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Attachment: HADOOP-1158_4_20070817.patch

Fixed the warning and updated to reflect changes to trunk...

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Attachment: HADOOP-1158_20070702_1.patch

Early patch while I continue testing it further...

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>         Attachments: HADOOP-1158_20070702_1.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Attachment: HADOOP-1158_2_20070808.patch

Finally a new patch incorporating Devaraj's comments and well-tested too... 

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520893 ] 

Hadoop QA commented on HADOOP-1158:
-----------------------------------

-1, new javadoc warnings

The javadoc tool appears to have generated warning messages when testing the latest attachment http://issues.apache.org/jira/secure/attachment/12363993/HADOOP-1158_4_20070817.patch against trunk revision r567308.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/573/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/573/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Status: Patch Available  (was: Open)

Re-submitting to check if the warnings were indeed fixed by HADOOP-1726 .

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521883 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

Looks like the javadoc warnings were caused by removal of the ant jars and fixed by HADOOP-1726. 

I'll check if this patch needs to be updated to reflect changes to trunk and submit a new one...

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501743 ] 

Owen O'Malley commented on HADOOP-1158:
---------------------------------------

It looks good to me too.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Status: Patch Available  (was: Open)

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484281 ] 

Owen O'Malley commented on HADOOP-1158:
---------------------------------------

I'd propose a minor variant where the reduce/fetcher tries a few (2? 3?) times before complaining to the JobTracker to cut down on the noise.

I think that restarting Jetty wouldn't be that useful and would potentially cause more trouble.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Attachment: HADOOP-1158_5_20070823.patch

Fix javadoc warnings...

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522150 ] 

Enis Soztutar commented on HADOOP-1158:
---------------------------------------

The patch looks good, but i would like to mention another major issue here. 
There are some cases when TaskTracker send heartbeats, but the jetty server cannot serve the outputs. Recently we have seen the jetty servers failing to allocate new threads from the thread pool on some of the tasktrackers, emiting logs:
{noformat}
  2007-08-23 09:31:46,378 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on SocketListener0@0.0.0.0:50060
  2007-08-23 09:31:46,379 WARN org.mortbay.http.SocketListener: OUT OF THREADS: SocketListener0@0.0.0.0:50060
{noformat}

Moreover, HADOOP-1179 mentions OOM exceptions related to Jetty. We will try to find and eliminate the sources of jetty related leaks and bugs, but it is not likely that all of them will be resolved. There will be cases such as above that RPC responds but http may not, so taking a "computer engineering approach" by solving the problem by restarting seems appropriate. 

long story short, i think it would be great to do some bookkeeping in JT about failed fetches per TT and send reinit action to TT above some threshold. 

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Status: Open  (was: Patch Available)

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Status: Patch Available  (was: Open)

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501791 ] 

Devaraj Das commented on HADOOP-1158:
-------------------------------------

Given the fact that in general losing reduces is detrimental, I'd propose a minor variant to the logic behind killing reduces. The reduce should kill itself when it fails to fetch the map output from even the new location, i.e., the unique 5 faulty fetches should have at least 1 retrial (i.e., we don't kill a reduce too early).

Also, does it make sense to have the logic behind killing/reexecuting reduces in the JobTracker. Two reasons:
1) since the JobTracker knows very well how many times a reduce complained, and, for which maps it complained, etc., 
2) consistent behavior - jobtracker handles the reexecution of maps and it might handle the reexecution of reduces as well.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1158:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Arun!

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502222 ] 

Devaraj Das commented on HADOOP-1158:
-------------------------------------

Arun, I think this issue is kind of a slightly longer term solution a problem and we have time enough with us to work towards that. I'd still argue that the best place to have the logic behind killing the reduces is in the JobTracker. Exceptions like the disk exception, ping exception, are very local cases where a task decides to kill itself, but in this issue there is a certain element of globalness involved (like dependency on maps), and the JobTracker is the only guy who has a global picture of jobs. I don't see how we will lose simpilicity by having the logic in the JobTracker. I understand that it will have to maintain a few bytes more per task, but that's not unreasonable.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501859 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

bq. The reduce should kill itself when it fails to fetch the map output from even the new location, i.e., the unique 5 faulty fetches should have at least 1 retrial (i.e., we don't kill a reduce too early).

Though it makes sense in the long-term I'd vote we keep it simple for now... to implement this would entail more complex code and more state to be maintained. 5 notifications anyway mean that the reducer has seen 20 attempts to fetch on 5 different maps fail. I'd say, for now, it's a sufficient reason to kill the reducer.

bq.Also, does it make sense to have the logic behind killing/reexecuting reduces in the JobTracker. Two reasons:
bq.1) since the JobTracker knows very well how many times a reduce complained, and, for which maps it complained, etc.,

If the reducer kills itself, the JobTracker need not maintain information of *which* reduces failed to fetch *which* maps, it could just do with a per-taskid count of failed fetches (for the maps, as notified by reducers) - again leads to simpler code for a first-shot. 

bq.2) consistent behavior - jobtracker handles the reexecution of maps and it might handle the reexecution of reduces as well.

I agree with the general sentiment, but given that this leads to more complex code and the reducer already knows it has failed to fetch from 5 different maps it doesn't make sense for it to wait for the JobTracker to fail the task. Also, there is an existing precedent for this behaviour in TaskTracker.fsError (task is marked as 'failed' by the TaskTracker itself on an FSError).

Thoughts?

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501591 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

Some early thoughts...

Bottomline: we don't want the reducer and hence the job to get stuck forever. 

The main issue is that when a reducer is stuck in shuffle it's hard to accurately say whether the fault lies at the map (jetty acting weird) or at the reduce or both. Having said that it's pertinent to keep in mind that _normally_ maps are cheaper to re-execute.

Given the above I'd like to propose something along these lines:

a) The reduce maintains a per-map count of fetch failures.

b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to the JobTracker via a new rpc: 
{code:title=JobTracker.java}
public synchronized void notifyFailedFetch(String reduceTaskId, String mapTaskId) {
  // ...
}
{code}

c) The JobTracker maintains a per-map count of failed-fetch notfications, and given a sufficient no. of them (say 2/3?) from *any* reducer (even multiple times from the same reducer) fails the map and re-schedules it elsewhere.
  
  This handles 2 cases: a) Faulty maps are re-executed and b) Corner case where only the last reducer is stuck on a given map and hence the map will have to be re-executed.

d) To counter the case of faulty reduces, we could implement a scheme where the reducer kills itself when it notifies the JobTracker of more than, say 5 unique, faulty fetches. This will ensure that a faulty reducer will not result in the JobTracker spawning maps willy-nilly...

Thoughts?

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1158:
---------------------------------

    Status: Open  (was: Patch Available)

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Attachment: HADOOP-1158_3_20070809.patch

Patch which fixes findbugs' warnings...

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501870 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

bq.b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to the JobTracker via a new rpc:

I take that back, I'm propose we use augument TaskStatus itself to let the JobTracker know about the failed-fetches i.e. map taskids, we could just add an new RPC to TaskUmbilicalProtocol for the reduce-task to let the TaskTracker know about the failed fetch.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1158:
---------------------------------

    Status: Open  (was: Patch Available)

This generates a new compiler warning for me:

{noformat}
    [javac] /home/cutting/src/hadoop/trunk/src/java/org/apache/hadoop/mapred/ReduceTaskStatus.java:48: warning: [unchecked] unchecked cast
    [javac] found   : java.lang.Object
    [javac] required: java.util.List<java.lang.String>
    [javac]       (List<String>)(((ArrayList<String>)failedFetchTasks).clone());
    [javac]                     ^
{noformat}

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Fix Version/s: 0.15.0
           Status: Patch Available  (was: Open)

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518643 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

TestDFSUpgrade failed with:

{noformat}
junit.framework.AssertionFailedError: expected:<1790222743> but was:<3731403302>
	at org.apache.hadoop.dfs.TestDFSUpgrade.checkResult(TestDFSUpgrade.java:74)
	at org.apache.hadoop.dfs.TestDFSUpgrade.testUpgrade(TestDFSUpgrade.java:142
{noformat}

as noted by Nicholas here: http://issues.apache.org/jira/browse/HADOOP-1696#action_12518584

I'm not sure why this failed... passes on my local-box.


> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1158:
----------------------------------

    Status: Open  (was: Patch Available)

Need to fix the findbugs warnings...

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518624 ] 

Hadoop QA commented on HADOOP-1158:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12363458/HADOOP-1158_3_20070809.patch against trunk revision r564012.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/535/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/535/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522027 ] 

Hadoop QA commented on HADOOP-1158:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12364381/HADOOP-1158_5_20070823.patch applied and successfully tested against trunk revision r568809.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/602/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/602/console

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1158_20070702_1.patch, HADOOP-1158_2_20070808.patch, HADOOP-1158_3_20070809.patch, HADOOP-1158_4_20070817.patch, HADOOP-1158_5_20070823.patch
>
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485481 ] 

Devaraj Das commented on HADOOP-1158:
-------------------------------------

Yes, the reduce/fetcher should try a few times (it could be configurable?) before complaining to the JobTracker. The JobTracker can take a decision on whether to reexecute a Map based on the % of complaints (>50% ?) from fetching reduces. For example, if there are 10 reduces currently fetching, and if at least 5 of them complained about a fetch failing for a particular Map, then the JobTracker should reexecute that Map. Makes sense?

> JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a fetch for a particular map output failed. If this exceeds a certain threshold, then that map should be declared as lost, and should be reexecuted elsewhere. Based on the number of such complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.