You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2007/01/26 00:23:50 UTC

[jira] Commented: (HADOOP-939) No-sort optimization

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467693 ] 

Owen O'Malley commented on HADOOP-939:
--------------------------------------

I think the complexity of the general case makes this problematic. I wouldn't want to see a config option to do this, because it will be easy for users to get it wrong. 

There are some more specific cases that might be interesting:
  1. After the spill of the map outputs, it would make sense to continue appending to the spill as long as the outputs from the map are sorted. Note that the partition is the primary key for that sort.
  2. The reduces should be scheduled near the map output. That would help in the case where each reduce is getting inputs from a small number of maps.

Note that even if the map outputs are sorted, the reduce needs to do a merge sort because there the map outputs are fetched in a fairly random order.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.