You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Feng Jiang (JIRA)" <ji...@apache.org> on 2006/11/07 04:32:39 UTC

[jira] Created: (HADOOP-686) job.setOutputValueComparatorClass(theClass) should be supported

job.setOutputValueComparatorClass(theClass) should be supported
---------------------------------------------------------------

                 Key: HADOOP-686
                 URL: http://issues.apache.org/jira/browse/HADOOP-686
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
         Environment: all environment
            Reporter: Feng Jiang


if the input of Reduce phase is :

K2, V3
K2, V2
K1, V5
K1, V3
K1, V4

in the current hadoop, the reduce output could be:
K1, (V5, V3, V4)
K2, (V3, V2)

But I hope hadoop supports job.setOutputValueComparatorClass(theClass), so that i can make values are in order, and the output could be:
K1, (V3, V4, V5) 
K2, (V2, V3)

This feature is very important, I think. Without it, we have to take the sorting by ourselves, and have to worry about the possibility that the values are too large to fit into memory. Then the codes becomes too hard to read. That is the reason why i think this feature is so important, and should be done in the hadoop framework.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-686) job.setOutputValueComparatorClass(theClass) should be supported

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-686?page=comments#action_12447725 ] 
            
Runping Qi commented on HADOOP-686:
-----------------------------------

HADOOP-485  allows you to introduce a secondary key for sorting the map outputs, while use the primary key only to group the values passed to reduce.

In the above example, the second elements in the keys of the map outputs are the secondary keys.
They are used for sorting. The sorted data look like:

(K1,3), V3 
(K1,4), V4 
(K1,5), V5 
(K2,2), V2 
(K2,3), V3 

However, only the first elements in the keys are used for grouping. Thus, the following data will be passed to reduces:

(K1), {V3, V4, V5} 
(K2), {V2, V3} 

In general, the user can specify a function f on the values of map outputs,such that, for each key/value pair
k,v in the map output, we will actually emit
(k,f(v)), v

where (k,f(v)) will be used for sorting while k will be used for grouping.
The order defined on f(v) should be the desired order on the v.
A trivial example is that f is an identity function. The problem with that is that the values are duplicated in the map output, and the sorting on (k, v) may be too expensive. In general, we hope that the value of f(v) be simple values such as numbers or strings such that their sizes are much smaller than v, and sorting on them can be done efficiently.

  




> job.setOutputValueComparatorClass(theClass) should be supported
> ---------------------------------------------------------------
>
>                 Key: HADOOP-686
>                 URL: http://issues.apache.org/jira/browse/HADOOP-686
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all environment
>            Reporter: Feng Jiang
>
> if the input of Reduce phase is :
> K2, V3
> K2, V2
> K1, V5
> K1, V3
> K1, V4
> in the current hadoop, the reduce output could be:
> K1, (V5, V3, V4)
> K2, (V3, V2)
> But I hope hadoop supports job.setOutputValueComparatorClass(theClass), so that i can make values are in order, and the output could be:
> K1, (V3, V4, V5) 
> K2, (V2, V3)
> This feature is very important, I think. Without it, we have to take the sorting by ourselves, and have to worry about the possibility that the values are too large to fit into memory. Then the codes becomes too hard to read. That is the reason why i think this feature is so important, and should be done in the hadoop framework.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-686) job.setOutputValueComparatorClass(theClass) should be supported

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley resolved HADOOP-686.
----------------------------------

       Resolution: Duplicate
    Fix Version/s: 0.13.0
         Assignee:     (was: Owen O'Malley)

This was fixed by HADOOP-485.

> job.setOutputValueComparatorClass(theClass) should be supported
> ---------------------------------------------------------------
>
>                 Key: HADOOP-686
>                 URL: https://issues.apache.org/jira/browse/HADOOP-686
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all environment
>            Reporter: Feng Jiang
>             Fix For: 0.13.0
>
>
> if the input of Reduce phase is :
> K2, V3
> K2, V2
> K1, V5
> K1, V3
> K1, V4
> in the current hadoop, the reduce output could be:
> K1, (V5, V3, V4)
> K2, (V3, V2)
> But I hope hadoop supports job.setOutputValueComparatorClass(theClass), so that i can make values are in order, and the output could be:
> K1, (V3, V4, V5) 
> K2, (V2, V3)
> This feature is very important, I think. Without it, we have to take the sorting by ourselves, and have to worry about the possibility that the values are too large to fit into memory. Then the codes becomes too hard to read. That is the reason why i think this feature is so important, and should be done in the hadoop framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-686) job.setOutputValueComparatorClass(theClass) should be supported

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-686?page=comments#action_12447621 ] 
            
eric baldeschwieler commented on HADOOP-686:
--------------------------------------------

There exist hacks to achieve this effect now.  I'll try to scare up an expert to document them.  Also a good long term feature.

> job.setOutputValueComparatorClass(theClass) should be supported
> ---------------------------------------------------------------
>
>                 Key: HADOOP-686
>                 URL: http://issues.apache.org/jira/browse/HADOOP-686
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all environment
>            Reporter: Feng Jiang
>
> if the input of Reduce phase is :
> K2, V3
> K2, V2
> K1, V5
> K1, V3
> K1, V4
> in the current hadoop, the reduce output could be:
> K1, (V5, V3, V4)
> K2, (V3, V2)
> But I hope hadoop supports job.setOutputValueComparatorClass(theClass), so that i can make values are in order, and the output could be:
> K1, (V3, V4, V5) 
> K2, (V2, V3)
> This feature is very important, I think. Without it, we have to take the sorting by ourselves, and have to worry about the possibility that the values are too large to fit into memory. Then the codes becomes too hard to read. That is the reason why i think this feature is so important, and should be done in the hadoop framework.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira