You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ravi Gummadi (JIRA)" <ji...@apache.org> on 2009/04/15 16:51:14 UTC
[jira] Commented: (HADOOP-5572) The map progress value should have a separate phase for doing the final sort.

    [ https://issues.apache.org/jira/browse/HADOOP-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699224#action_12699224 ] 

Ravi Gummadi commented on HADOOP-5572:
--------------------------------------

We are planning to allocate 33% of map task's progress to final sort.

Since merge progress is not updated currently(both map side and reduce side), even if we allocate 33% of mapTask progress to sort(merge), map progress will be stuck at 66.7% till sort(merge) is finished and progress will jump from 66.7% to 100%. This could affect speculative execution.

Here is a proposal for updating sort/merge progress approximately.

In merge(), we consider the smallest io.sort.factor files for each merge. So we assume that there is no combiner and we calculate the denominator for mergeProgress using the following before the begining of merges:

We maintain a list of sizes of segments to be merged(sorted list). We add the sizes of smallest factor segments(that are going be merged first) and add the sum to the list and remove the smallest factor sizes. Do this again and again until we are left with 1 element in the list. This element is the denominator for mergeProgress for 1st merge. 
As and when the segments are read for a merge, the numerator is incremented based on position in the segment and mergeProgress is updated.
Denominator is decreased by the difference (inputRecordsForThisMerge - mergedRecordsInThisMerge). This is to get better approximation of mergeProgress with combiner being called in merges.

mergeProgress is not very accurate(when combiner is used in merges) in the above approach because of 2 reasons:
(1) Exact estimation of total size of data(going to be merged in all the merges) before merges is not possible when combiner is there.
(2) sizes of compressed and uncompressed segments(inMemory segments) are treated alike.

This would also avoid jump of reduce task progress from 33.3% to 66.7%. On reduce side, for mergeProgress, we will have to avoid adding the sizes of segments of last merge of factor segments in estimating the total size of data that will be merged(computation of denominator from the list of sizes of segments), because the last merge is considered as part of the 3rd phase of reduce task(i.e. reduce phase).

Thoughts ?

> The map progress value should have a separate phase for doing the final sort.
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-5572
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5572
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Ravi Gummadi
>
> Currently, the final spill and sort doesn't record any progress while it runs, leading to the perception that the map is done, but "stuck".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.