You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "arkady borkovsky (JIRA)" <ji...@apache.org> on 2006/11/14 07:49:37 UTC

[jira] Created: (HADOOP-717) When there are few reducers, sorting should be done by mappers

When there are few reducers, sorting should be done by mappers
--------------------------------------------------------------

                 Key: HADOOP-717
                 URL: http://issues.apache.org/jira/browse/HADOOP-717
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: arkady borkovsky


If I understand correctly, currently, sort happens on the reducer side.
So if few hundred mappers produce few (or many) Gig of data, and there is just ONE reduce to consume it, copying and sorting takes forever.

It may make sense to have a special case optimization for a single reducer.  (E.g. "when there is only reducer and many mappers, sort is done by the mappers, and reducer does only a merge")

Or to have some smarter policy that makes sure that sorting uses as many CPUs as it makes sense.   If  the map step has produced data on all the nodes of the cluster, it makes sense to use all the nodes for sorting.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-717) When there are few reducers, sorting should be done by mappers

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-717?page=comments#action_12449581 ] 
            
Devaraj Das commented on HADOOP-717:
------------------------------------

This is handled by Hadoop-331 (work in progress)

> When there are few reducers, sorting should be done by mappers
> --------------------------------------------------------------
>
>                 Key: HADOOP-717
>                 URL: http://issues.apache.org/jira/browse/HADOOP-717
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: arkady borkovsky
>
> If I understand correctly, currently, sort happens on the reducer side.
> So if few hundred mappers produce few (or many) Gig of data, and there is just ONE reduce to consume it, copying and sorting takes forever.
> It may make sense to have a special case optimization for a single reducer.  (E.g. "when there is only reducer and many mappers, sort is done by the mappers, and reducer does only a merge")
> Or to have some smarter policy that makes sure that sorting uses as many CPUs as it makes sense.   If  the map step has produced data on all the nodes of the cluster, it makes sense to use all the nodes for sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Resolved: (HADOOP-717) When there are few reducers, sorting should be done by mappers

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-717?page=all ]

Owen O'Malley resolved HADOOP-717.
----------------------------------

    Fix Version/s: 0.10.0
       Resolution: Fixed

This was fixed by HADOOP-331.

> When there are few reducers, sorting should be done by mappers
> --------------------------------------------------------------
>
>                 Key: HADOOP-717
>                 URL: http://issues.apache.org/jira/browse/HADOOP-717
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: arkady borkovsky
>         Assigned To: Owen O'Malley
>             Fix For: 0.10.0
>
>
> If I understand correctly, currently, sort happens on the reducer side.
> So if few hundred mappers produce few (or many) Gig of data, and there is just ONE reduce to consume it, copying and sorting takes forever.
> It may make sense to have a special case optimization for a single reducer.  (E.g. "when there is only reducer and many mappers, sort is done by the mappers, and reducer does only a merge")
> Or to have some smarter policy that makes sure that sorting uses as many CPUs as it makes sense.   If  the map step has produced data on all the nodes of the cluster, it makes sense to use all the nodes for sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira