You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Teodor Macicas <te...@epfl.ch> on 2010/08/24 11:21:39 UTC

Hadoop sorting algorithm on equal keys

Hello,

Let's say that we have two maps outputs which will be sorted before the 
reducer will start. Doesn't matter what {a,b0,b1,c} mean, but let's 
assume that b0=b1.
Map output1 : a, b0
Map output2:  c, b1
In this case we can have 2 different sets of sorted data:
1. {a,b0,b1,c}  and
2. {a,b1,b0,c}  since b0=b1 .

In my particular problem I want to distingush between b0 and b1. 
Basically, they are numbers but I have extra-info on which my comparison 
will be made.
Now, the question is: how can I change Hadoop default behaviour in order 
to control the sorting algorithm on equal keys ?

Thank you in advance.
Best,
Teodor

Re: Hadoop sorting algorithm on equal keys

Posted by Owen O'Malley <om...@apache.org>.
On Aug 24, 2010, at 2:21 AM, Teodor Macicas wrote:

> Hello,
>
> Let's say that we have two maps outputs which will be sorted before  
> the reducer will start. Doesn't matter what {a,b0,b1,c} mean, but  
> let's assume that b0=b1.
> Map output1 : a, b0
> Map output2:  c, b1
> In this case we can have 2 different sets of sorted data:
> 1. {a,b0,b1,c}  and
> 2. {a,b1,b0,c}  since b0=b1 .
>
> In my particular problem I want to distingush between b0 and b1.  
> Basically, they are numbers but I have extra-info on which my  
> comparison will be made.
> Now, the question is: how can I change Hadoop default behaviour in  
> order to control the sorting algorithm on equal keys ?

You need to extend the keys with the extra information to sort on. To  
get exactly one call to reduce for each logical key, you define a  
grouping comparator that determines when two keys should be distinct  
calls to reduce. Look at the SecondarySort example in MapReduce. http://bit.ly/a9B7hh

-- Owen

Re: Hadoop sorting algorithm on equal keys

Posted by Gang Luo <lg...@yahoo.com.cn>.
seems you have to insert a tag in the map output tuple which tells where this 
tuple come from. At reduce side, you write your own sort with the tag involved. 


-Gang




----- 原始邮件 ----
发件人: Teodor Macicas <te...@epfl.ch>
收件人: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
发送日期: 2010/8/24 (周二) 5:21:39 上午
主   题: Hadoop sorting algorithm on equal keys

Hello,

Let's say that we have two maps outputs which will be sorted before the reducer 
will start. Doesn't matter what {a,b0,b1,c} mean, but let's assume that b0=b1.
Map output1 : a, b0
Map output2:  c, b1
In this case we can have 2 different sets of sorted data:
1. {a,b0,b1,c}  and
2. {a,b1,b0,c}  since b0=b1 .

In my particular problem I want to distingush between b0 and b1. Basically, they 
are numbers but I have extra-info on which my comparison will be made.
Now, the question is: how can I change Hadoop default behaviour in order to 
control the sorting algorithm on equal keys ?

Thank you in advance.
Best,
Teodor