You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2008/03/01 10:25:51 UTC

[jira] Created: (HADOOP-2920) Optimize the last merge of the map output files

Optimize the last merge of the map output files
-----------------------------------------------

                 Key: HADOOP-2920
                 URL: https://issues.apache.org/jira/browse/HADOOP-2920
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Devaraj Das


In ReduceTask, today we do merges of io.sort.factor number of files everytime we merge and write the result back to disk. The last merge can probably be better. For example, if there are io.sort.factor + 10 files at the end, today we will merge 100 files into one and then return an iterator over the remaining 11 files. This can be improved (in terms of disk I/O) to merge the smallest 11 files and then return an iterator over the 100 remaining files. Other option is to not do any single level merge when we have io.sort.factor + n files remaining (where n << io.sort.factor) but just return the iterator directly. Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2920) Optimize the last merge of the map output files

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574700#action_12574700 ] 

Runping Qi commented on HADOOP-2920:
------------------------------------


When there are io.sort.factor + n + 1 files, then merging the smallest n+2 files should be the right approach.
 


> Optimize the last merge of the map output files
> -----------------------------------------------
>
>                 Key: HADOOP-2920
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2920
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>
> In ReduceTask, today we do merges of io.sort.factor number of files everytime we merge and write the result back to disk. The last merge can probably be better. For example, if there are io.sort.factor + 10 files at the end, today we will merge 100 files into one and then return an iterator over the remaining 11 files. This can be improved (in terms of disk I/O) to merge the smallest 11 files and then return an iterator over the 100 remaining files. Other option is to not do any single level merge when we have io.sort.factor + n files remaining (where n << io.sort.factor) but just return the iterator directly. Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2920) Optimize the last merge of the map output files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574682#action_12574682 ] 

Doug Cutting commented on HADOOP-2920:
--------------------------------------

> Other option is to not do any single level merge when we have io.sort.factor + n files remaining (where n << io.sort.factor) but just return the iterator directly. Thoughts?

But then when there are io.sort.factor + n + 1 files we'll have to do a merge.  So the limit should thus be io.sort.factor + n + 1.  But then, if we have io.sort.factor + n + 2 we'll have to merge, so ...

I think io.sort.factor should be a hard limit.  If folks are merging too often, they can increase it by n.

Larger sort factors result in more seeking while merging, so, at some point, a two-level merge is faster than a one-level merge.  One should aim for setting io.sort.factor to that point (although the point moves with buffer sizes).  If that's configured correctly, then merging to disk should be faster, or at least no slower, than bumping the merge factor.

> Optimize the last merge of the map output files
> -----------------------------------------------
>
>                 Key: HADOOP-2920
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2920
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>
> In ReduceTask, today we do merges of io.sort.factor number of files everytime we merge and write the result back to disk. The last merge can probably be better. For example, if there are io.sort.factor + 10 files at the end, today we will merge 100 files into one and then return an iterator over the remaining 11 files. This can be improved (in terms of disk I/O) to merge the smallest 11 files and then return an iterator over the 100 remaining files. Other option is to not do any single level merge when we have io.sort.factor + n files remaining (where n << io.sort.factor) but just return the iterator directly. Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.