You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Rui Shi <sh...@yahoo.com> on 2007/12/04 02:10:11 UTC

Multiple keys

Hi,

I need to sort the data by multiple keys. Is there any built-in support in Hadoop? 

Thanks,

Rui



      ____________________________________________________________________________________
Be a better pen pal. 
Text or chat with friends inside Yahoo! Mail. See how.  http://overview.mail.yahoo.com/

Re: Multiple keys

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

Rui Shi wrote:
> Hi,
> 
> I need to sort the data by multiple keys. Is there any built-in support in Hadoop? 
> 

Rui, could you sketch the exact task on hand for us?

Generally, the idea to set the map-output keys to be _complex_ and 
define necessary comparators to sort by multiple keys.

E.g.

Map-input: <K1, V1>
Map-output: <(K2, K3), V2>
Reduce-output: <K4, V3>

So, as long as you have the necessary comparator defined for (K2, K3) 
you are golden.

Does that work for you?

Arun

> Thanks,
> 
> Rui
> 
> 
> 
>       ____________________________________________________________________________________
> Be a better pen pal. 
> Text or chat with friends inside Yahoo! Mail. See how.  http://overview.mail.yahoo.com/

Re: Multiple keys

Posted by Ted Dunning <td...@veoh.com>.

There is the largely undocumented record stream stuff.  You define your
records in an IDL-like language which compiles to java code.  I haven't used
it, but it doesn't look particularly hard.

I believe that this stuff includes definitions of comparators.

Also, if you just put concatenated keys into the key that is output from the
mapper, you effectively get multi-key sorting.

If you really mean that you want to sort the values that your reduce
functions get, that is also possible.  The trick is that you need to define
a key that includes both the partitioning data (to determine which records
get grouped together for reducing) and the sort key (to determine what order
the reduce sees the data in).  This means that you have to define two
functions in your job config.  I don't have sample code just off-hand for
this, but it isn't hard to figure out from the javadocs.

On 12/3/07 5:10 PM, "Rui Shi" <sh...@yahoo.com> wrote:

> Hi,
> 
> I need to sort the data by multiple keys. Is there any built-in support in
> Hadoop? 
> 
> Thanks,
> 
> Rui
> 
> 
> 
>       
> ______________________________________________________________________________
> ______
> Be a better pen pal.
> Text or chat with friends inside Yahoo! Mail. See how.
> http://overview.mail.yahoo.com/