You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Sanjay Dahiya (JIRA)" <ji...@apache.org> on 2006/09/19 12:57:22 UTC
[jira] Created: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
-------------------------------------------------------------------------------------------------------
Key: HADOOP-547
URL: http://issues.apache.org/jira/browse/HADOOP-547
Project: Hadoop
Issue Type: Bug
Components: mapred
Affects Versions: 0.6.2
Reporter: Sanjay Dahiya
In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
ReduceTaskRunner.java:490
try {
copyResults.wait(); <=========== Calls unconditional wait.
} catch (InterruptedException e) { }
wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=comments#action_12437635 ]
Owen O'Malley commented on HADOOP-547:
--------------------------------------
Instead of adding a new timer to the ReduceTaskRunner, I think it would be far easier to have the PingTimer just call reportProgress when the progress() method is called.
To get access to the TaskTracker and Task, PingTimer would be a non-static inner class instead of static.
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
> Attachments: Hadoop-547.patch
>
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=all ]
Sanjay Dahiya updated HADOOP-547:
---------------------------------
Attachment: Hadoop-547.patch
Here is a patch for review -
It makes sure that reduce task, sends a heartbeat/progress when none of copy tasks finishes with in "mapred.task.timeout". It replaces the unconditional wait with a conditional wait with a timeout of (mapred.task.timeout)/2. (we could make it 3/4th of this timeout as well).
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
> Attachments: Hadoop-547.patch
>
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=all ]
Sanjay Dahiya updated HADOOP-547:
---------------------------------
Attachment: Hadoop-547_1.patch
updated patch, makes PingTimer non-static and sends a progress report in pingTimer.progress().
progress update time could be too frequent ( 1 sec ), it should be probably changed to a higher value in another patch.
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
> Attachments: Hadoop-547.patch, Hadoop-547_1.patch
>
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Work started: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=all ]
Work on HADOOP-547 started by Sanjay Dahiya.
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=comments#action_12437669 ]
Owen O'Malley commented on HADOOP-547:
--------------------------------------
This looks good, try it out and see if it helps.
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
> Attachments: Hadoop-547.patch, Hadoop-547_1.patch
>
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Assigned: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=all ]
Sanjay Dahiya reassigned HADOOP-547:
------------------------------------
Assignee: Sanjay Dahiya
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=all ]
Doug Cutting updated HADOOP-547:
--------------------------------
Status: Resolved (was: Patch Available)
Fix Version/s: 0.7.0
Resolution: Fixed
I just committed this. Thanks, Sanjay!
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
> Fix For: 0.7.0
>
> Attachments: Hadoop-547.patch, Hadoop-547_1.patch
>
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-547) ReduceTaskRunner can miss sending
hearbeats if no map output copy finishes within "mapred.task.timeout"
Posted by "Sanjay Dahiya (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-547?page=all ]
Sanjay Dahiya updated HADOOP-547:
---------------------------------
Status: Patch Available (was: In Progress)
> ReduceTaskRunner can miss sending hearbeats if no map output copy finishes within "mapred.task.timeout"
> -------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-547
> URL: http://issues.apache.org/jira/browse/HADOOP-547
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.6.2
> Reporter: Sanjay Dahiya
> Assigned To: Sanjay Dahiya
> Attachments: Hadoop-547.patch, Hadoop-547_1.patch
>
>
> In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which releases only if a copy thread finishes copying. This can cause good reduce tasks which are copying data to fail, if no map task output was copied within "mapred.task.timeout".
> ReduceTaskRunner.java:490
> try {
> copyResults.wait(); <=========== Calls unconditional wait.
> } catch (InterruptedException e) { }
> wait() should be with a timeout, possibly taskTimeout/2 after which it should send a hearbeat and go back to wait.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira