You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2006/05/24 01:27:29 UTC

[jira] Created: (HADOOP-248) locating map outputs via random probing is inefficient

locating map outputs via random probing is inefficient
------------------------------------------------------

         Key: HADOOP-248
         URL: http://issues.apache.org/jira/browse/HADOOP-248
     Project: Hadoop
        Type: Improvement

  Components: mapred  
    Versions: 0.2.1    
    Reporter: Owen O'Malley
 Assigned to: Owen O'Malley 
     Fix For: 0.3


Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:

class MapLocationResults {
   public int getTimestamp();
   public MapOutputLocation[] getLocations();
}

interface InterTrackerProtocol {
  ...
  MapLocationResults locateMapOutputs(int prevTimestamp);
} 

with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-248:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.12.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Devaraj!

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.12.0
>
>         Attachments: 248-9.patch, 248-fixed1.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475161 ] 

Nigel Daley commented on HADOOP-248:
------------------------------------

Sorry, ignore Hadoop QA's -1.

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.12.0
>
>         Attachments: 248-9.patch, 248-fixed1.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467484 ] 

Hadoop QA commented on HADOOP-248:
----------------------------------

+1, because http://issues.apache.org/jira/secure/attachment/12349617/248-initial8.patch applied and successfully tested against trunk revision r499156.

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting reopened HADOOP-248:
---------------------------------


I just reverted this, since it was causing things to hang.

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.11.0
>
>         Attachments: 248-9.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465483 ] 

Sameer Paranjpye commented on HADOOP-248:
-----------------------------------------

Sounds good. How do we deal with map failures? Do we get multiple completion events when there are re-tries?  

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-248:
-------------------------------

    Attachment: 248-initial7.patch

The patch does the following:
1) Modifies the protocol version for InterTrackerProtocol since there is a major change there in the way map output is fetched. The method locateMapOutputs has been removed.
2) The getTaskCompletion method in both InterTrackerProtocol and JobSubmission has been changed to take an extra argument - the max no. of events we want to fetch from the JobTracker.
3) Two more fields are added in TaskCompletionEvents - the real-ID portion of the taskId string (for e.g., if the taskId is task_0001_m_000003_0, the real-ID is 3 within the job), and another boolean field to indicate whether the task is a map or not. By the way, these could have done on the Reduce side also by parsing the taskId string, but I think this is a more general way of doing it and it also is in line with the thought of having "ID objects" in the future.
4) Only 10 events are fetched at a time by the JobClient
5) A new value OBSELETE has been added to TaskCompletionEvent.Status to signify lost tasks (which were earlier reported as SUCCEEDED). For this, whenever a FAILED/KILLED TaskStatus is got, it is checked whether a SUCCEEDED was earlier reported for the same taskId, and if so that event is marked as OBSELETE.
6) The number of events probed by the ReduceTaskRunner at any time is equal to max(5*numCopiers, 50).

Feedback appreciated..

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-initial7.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465422 ] 

Devaraj Das commented on HADOOP-248:
------------------------------------

Propose the following:

1) Modify the "TaskCompletionEvent[] getTaskCompletionEvents(String jobid, int fromEventId)" defined in IntertrackerProtocol and JobSubmissionProtocol to have a new argument that will signify how many events we want to fetch. We may get a smaller number depending on how many events got registered for the job. 
So, it becomes: TaskCompletionEvent[] getTaskCompletionEvents(String jobid, int fromEventId, int maxEvents)
This will generally be more scalable. In the case of map-output-fetches, it helps in the way that we do the same thing as we do today (except that the randomness is not there and the TT exactly knows which maps finished).

2) Since the events IDs are numbered in a monotonically increasing sequence for a Job, we don't need to maintain timestamps (as the original comment on this bug suggests).

3) Add a "boolean isMapTask()" method to TaskCompletionEvent class that will return true if the event is from a map task, false otherwise.

Comments?

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-248:
-------------------------------

    Attachment: 248-fixed1.patch

This had a problem introduced unintentionally in the last submission (by Owen, when he corrected the spelling of OBSOLETE, etc.). The problem was that there is a variable called fromEventId which is used to track from which eventId a tasktracker should fetch events from from the jobtracker. This was earlier a IntWritable object, so that set(<somenumber>) could be done on the object and the new value of the 'int' within the object could be seen even when the method invocation returned. This variable was changed to an int and instead "fromEventId += <somenumber>" was done. Unfortunately, this would not be visible when the method invocation returned and hence the TaskTracker would get stuck at a particular eventId and would make no forward progress...
Attached is the new patch which has the IntWritable stuff put back in, and also the method JobClient.listEvents has been modified to take two extra args - fromEventId, numEvents (this method didn't exist when I was earlier working on this issue). The JobSubmissionProtocol version has been changed also to reflect the change in the getTaskCompletionEvents protocol method (missed this in the earlier patch).

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-9.patch, 248-fixed1.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-248:
-------------------------------

    Status: Patch Available  (was: Reopened)

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-9.patch, 248-fixed1.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475160 ] 

Hadoop QA commented on HADOOP-248:
----------------------------------

-1, because the patch command could not apply the latest attachment (http://issues.apache.org/jira/secure/attachment/12351812/248-fixed1.patch) as a patch to trunk revision r510644. Please note that this message is automatically generated and may represent a problem with the automation system and not the patch. Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.12.0
>
>         Attachments: 248-9.patch, 248-fixed1.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-248:
-------------------------------

    Attachment: 248-initial8.patch

This patch does better failure handling. It saves the location of a failed map output fetch in a hashmap for later retrial. Also, it has logic which makes it prefer a newer location to the saved one.

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-248:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.11.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Devaraj!

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.11.0
>
>         Attachments: 248-9.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley reassigned HADOOP-248:
------------------------------------

    Assignee: Devaraj Das  (was: Owen O'Malley)

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-248:
-------------------------------

    Status: Patch Available  (was: Open)

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-248?page=all ]

Doug Cutting updated HADOOP-248:
--------------------------------

    Fix Version: 0.5.0
                     (was: 0.4.0)

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>          Key: HADOOP-248
>          URL: http://issues.apache.org/jira/browse/HADOOP-248
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.5.0

>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-248?page=all ]

Sameer Paranjpye updated HADOOP-248:
------------------------------------

    Fix Version: 0.4
                     (was: 0.3)

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>          Key: HADOOP-248
>          URL: http://issues.apache.org/jira/browse/HADOOP-248
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.4

>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-248:
---------------------------------

    Attachment: 248-9.patch

This is a minor modification of Devaraj's patch that fixes a spelling mistake (OBSELETE) and use TaskInProgress.partition instead of parsing the task id.

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-9.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467708 ] 

Hadoop QA commented on HADOOP-248:
----------------------------------

+1, because http://issues.apache.org/jira/secure/attachment/12349653/248-9.patch applied and successfully tested against trunk revision r500023.

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>         Attachments: 248-9.patch, 248-initial7.patch, 248-initial8.patch
>
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465511 ] 

Owen O'Malley commented on HADOOP-248:
--------------------------------------

Sounds good, Devaraj.

The events are per a taskid, not a tipid. So different attempts to run "map 0" would result in different events.

That said, however, we probably should make another event "lost" or something for tasks that are lost because their output had problems or the task tracker was lost.

We may also want to flag the "complete" events of lost tasks as obsolete so that reduces don't see them and try and fetch their outputs.

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira