You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2008/02/22 00:46:31 UTC

Sorting output data on value

hi,

Can I sort the output of reducer based on the value instead of key.
Also can I specify that the output should be sorted in decreasing order ?

Mapper output -
 <aWord, 1>

Reducer gets-
 <aWord, (1,1,...)>

and outputs -
<aWord, count>

e.g abc 10
      xyz  100

I want the output to be sorted based on the value and that too in
decreasing order -
     xyz 100
     abc  10

Any suggestions ?

thanks,
Taran

Re: Sorting output data on value

Posted by Doug Cutting <cu...@apache.org>.
Tarandeep Singh wrote:
> but isn't the output of reduce step sorted ?

No, the input of reduce is sorted by key.  The output of reduce is 
generally produced as the input arrives, so is generally also sorted by 
key, but reducers can output whatever they like.

Doug

Re: Sorting output data on value

Posted by Tarandeep Singh <ta...@gmail.com>.
On Fri, Feb 22, 2008 at 5:46 AM, Owen O'Malley <oo...@yahoo-inc.com> wrote:
>
>  On Feb 21, 2008, at 11:01 PM, Ted Dunning wrote:
>
>  >
>  > But this only guarantees that the results will be sorted within each
>  > reducers input.  Thus, this won't result in getting the results
>  > sorted by
>  > the reducers output value.
>
>  I thought the question was how to get the values sorted within a call
>  to reduce. Of course if you are trying to sort the reduce output on a
>  key other than the key that was used coming out of the map, you do
>  need another job.
>

Yes, I need to sort the output coming output of reduce... so the
solution is to run another MR job.

thanks guys for your replies... they were very useful.

-Taran
>  -- Owen
>

Re: Sorting output data on value

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 21, 2008, at 11:01 PM, Ted Dunning wrote:

>
> But this only guarantees that the results will be sorted within each
> reducers input.  Thus, this won't result in getting the results  
> sorted by
> the reducers output value.

I thought the question was how to get the values sorted within a call  
to reduce. Of course if you are trying to sort the reduce output on a  
key other than the key that was used coming out of the map, you do  
need another job.

-- Owen

Re: Sorting output data on value

Posted by Ted Dunning <td...@veoh.com>.
But this only guarantees that the results will be sorted within each
reducers input.  Thus, this won't result in getting the results sorted by
the reducers output value.


On 2/21/08 8:40 PM, "Owen O'Malley" <oo...@yahoo-inc.com> wrote:

> 
> On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote:
> 
>> It may be sorted within the output for a single reducer and,
>> indeed, you can
>> even guarantee that it is sorted but *only* by the reduce key.  The
>> order
>> that values appear will not be deterministic.
> 
> Actually, there is a better answer for this. If you put both the
> primary and secondary key into the key, you can use
> JobConf.setOutputValueGroupingComparator to set a comparator that
> only compares the primary key. Reduce will be called once per a
> primary key, but all of the values will be sorted by the secondary key.
> 
> See http://tinyurl.com/32gld4
> 
> -- Owen


Re: Sorting output data on value

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote:

> It may be sorted within the output for a single reducer and,  
> indeed, you can
> even guarantee that it is sorted but *only* by the reduce key.  The  
> order
> that values appear will not be deterministic.

Actually, there is a better answer for this. If you put both the  
primary and secondary key into the key, you can use  
JobConf.setOutputValueGroupingComparator to set a comparator that  
only compares the primary key. Reduce will be called once per a  
primary key, but all of the values will be sorted by the secondary key.

See http://tinyurl.com/32gld4

-- Owen

Re: Sorting output data on value

Posted by Ted Dunning <td...@veoh.com>.

It may be sorted within the output for a single reducer and, indeed, you can
even guarantee that it is sorted but *only* by the reduce key.  The order
that values appear will not be deterministic.

To sort by value, you need to run another MR job with the count from the
first step as the key and the old reducers output key as the value.  You
will only need an identity mapper.  If you use both the count and the key as
the new key and have an empty value, then you can do a two level sort in one
step.

Hadoop isn't magic.  If you want something sorted according to a new
ordering *something* will have to do the work.


On 2/21/08 5:38 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> On Thu, Feb 21, 2008 at 5:34 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>>  Use another job step to get the sort done.
>> 
> 
> but isn't the output of reduce step sorted ?
> Also can I specify that sort be done in reverse order ?
> 
>> 
>> 
>>  On 2/21/08 5:11 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>> 
>>> On Thu, Feb 21, 2008 at 3:46 PM, Tarandeep Singh <ta...@gmail.com>
>>> wrote:
>>>> hi,
>>>> 
>>>>  Can I sort the output of reducer based on the value instead of key.
>>>>  Also can I specify that the output should be sorted in decreasing order ?
>>>> 
>>>>  Mapper output -
>>>>   <aWord, 1>
>>>> 
>>>>  Reducer gets-
>>>>   <aWord, (1,1,...)>
>>>> 
>>>>  and outputs -
>>>>  <aWord, count>
>>>> 
>>>>  e.g abc 10
>>>>       xyz  100
>>>> 
>>>>  I want the output to be sorted based on the value and that too in
>>>>  decreasing order -
>>>>      xyz 100
>>>>      abc  10
>>>> 
>>>>  Any suggestions ?
>>>> 
>>> 
>>> I set the output format to Text and then converted the count into text
>>> and wrote this as key and the aWord as value. I was expecting an
>>> output sorted on the count now but it didn't work that way ? Could
>>> anyone explain why so ?
>>> 
>>> reducer output -
>>>   <000001, abc>
>>>   <000005, xyz>
>>>   <000002, pqr>
>>> 
>>> thanks,
>>> Taran
>>> 
>>> 
>>>>  thanks,
>>>>  Taran
>>>> 
>> 
>> 


Re: Sorting output data on value

Posted by Tarandeep Singh <ta...@gmail.com>.
On Thu, Feb 21, 2008 at 5:34 PM, Ted Dunning <td...@veoh.com> wrote:
>
>  Use another job step to get the sort done.
>

but isn't the output of reduce step sorted ?
Also can I specify that sort be done in reverse order ?

>
>
>  On 2/21/08 5:11 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>
>  > On Thu, Feb 21, 2008 at 3:46 PM, Tarandeep Singh <ta...@gmail.com> wrote:
>  >> hi,
>  >>
>  >>  Can I sort the output of reducer based on the value instead of key.
>  >>  Also can I specify that the output should be sorted in decreasing order ?
>  >>
>  >>  Mapper output -
>  >>   <aWord, 1>
>  >>
>  >>  Reducer gets-
>  >>   <aWord, (1,1,...)>
>  >>
>  >>  and outputs -
>  >>  <aWord, count>
>  >>
>  >>  e.g abc 10
>  >>       xyz  100
>  >>
>  >>  I want the output to be sorted based on the value and that too in
>  >>  decreasing order -
>  >>      xyz 100
>  >>      abc  10
>  >>
>  >>  Any suggestions ?
>  >>
>  >
>  > I set the output format to Text and then converted the count into text
>  > and wrote this as key and the aWord as value. I was expecting an
>  > output sorted on the count now but it didn't work that way ? Could
>  > anyone explain why so ?
>  >
>  > reducer output -
>  >   <000001, abc>
>  >   <000005, xyz>
>  >   <000002, pqr>
>  >
>  > thanks,
>  > Taran
>  >
>  >
>  >>  thanks,
>  >>  Taran
>  >>
>
>

Re: Sorting output data on value

Posted by Ted Dunning <td...@veoh.com>.
Use another job step to get the sort done.

On 2/21/08 5:11 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> On Thu, Feb 21, 2008 at 3:46 PM, Tarandeep Singh <ta...@gmail.com> wrote:
>> hi,
>> 
>>  Can I sort the output of reducer based on the value instead of key.
>>  Also can I specify that the output should be sorted in decreasing order ?
>> 
>>  Mapper output -
>>   <aWord, 1>
>> 
>>  Reducer gets-
>>   <aWord, (1,1,...)>
>> 
>>  and outputs -
>>  <aWord, count>
>> 
>>  e.g abc 10
>>       xyz  100
>> 
>>  I want the output to be sorted based on the value and that too in
>>  decreasing order -
>>      xyz 100
>>      abc  10
>> 
>>  Any suggestions ?
>> 
> 
> I set the output format to Text and then converted the count into text
> and wrote this as key and the aWord as value. I was expecting an
> output sorted on the count now but it didn't work that way ? Could
> anyone explain why so ?
> 
> reducer output -
>   <000001, abc>
>   <000005, xyz>
>   <000002, pqr>
> 
> thanks,
> Taran
> 
> 
>>  thanks,
>>  Taran
>> 


Re: Sorting output data on value

Posted by Tarandeep Singh <ta...@gmail.com>.
On Thu, Feb 21, 2008 at 3:46 PM, Tarandeep Singh <ta...@gmail.com> wrote:
> hi,
>
>  Can I sort the output of reducer based on the value instead of key.
>  Also can I specify that the output should be sorted in decreasing order ?
>
>  Mapper output -
>   <aWord, 1>
>
>  Reducer gets-
>   <aWord, (1,1,...)>
>
>  and outputs -
>  <aWord, count>
>
>  e.g abc 10
>       xyz  100
>
>  I want the output to be sorted based on the value and that too in
>  decreasing order -
>      xyz 100
>      abc  10
>
>  Any suggestions ?
>

I set the output format to Text and then converted the count into text
and wrote this as key and the aWord as value. I was expecting an
output sorted on the count now but it didn't work that way ? Could
anyone explain why so ?

reducer output -
  <000001, abc>
  <000005, xyz>
  <000002, pqr>

thanks,
Taran


>  thanks,
>  Taran
>