You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/01/17 12:28:27 UTC
[jira] Commented: (HADOOP-248) locating map outputs via random probing is inefficient

    [ https://issues.apache.org/jira/browse/HADOOP-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465422 ] 

Devaraj Das commented on HADOOP-248:
------------------------------------

Propose the following:

1) Modify the "TaskCompletionEvent[] getTaskCompletionEvents(String jobid, int fromEventId)" defined in IntertrackerProtocol and JobSubmissionProtocol to have a new argument that will signify how many events we want to fetch. We may get a smaller number depending on how many events got registered for the job. 
So, it becomes: TaskCompletionEvent[] getTaskCompletionEvents(String jobid, int fromEventId, int maxEvents)
This will generally be more scalable. In the case of map-output-fetches, it helps in the way that we do the same thing as we do today (except that the randomness is not there and the TT exactly knows which maps finished).

2) Since the events IDs are numbered in a monotonically increasing sequence for a Job, we don't need to maintain timestamps (as the original comment on this bug suggests).

3) Add a "boolean isMapTask()" method to TaskCompletionEvent class that will return true if the event is from a map task, false otherwise.

Comments?

> locating map outputs via random probing is inefficient
> ------------------------------------------------------
>
>                 Key: HADOOP-248
>                 URL: https://issues.apache.org/jira/browse/HADOOP-248
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.2.1
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>
> Currently the ReduceTaskRunner polls the JobTracker for a random list of map tasks asking for their output locations. It would be better if the JobTracker kept an ordered log and the interface was changed to:
> class MapLocationResults {
>    public int getTimestamp();
>    public MapOutputLocation[] getLocations();
> }
> interface InterTrackerProtocol {
>   ...
>   MapLocationResults locateMapOutputs(int prevTimestamp);
> } 
> with the intention that each time a ReduceTaskRunner calls locateMapOutputs, it passes back the "timestamp" that it got from the previous result. That way, reduces can easily find the new MapOutputs. This should help the "ramp up" when the maps first start finishing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira