You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Le Zhao <le...@cs.cmu.edu> on 2010/01/15 19:30:11 UTC

secondary sort implementation?

Hi All - I'm wondering if there is any justification for the current 
implementation of secondary sort.

The current implementation requires one to duplicate the second key in 
both the key, and the value fields.  Sorting on both keys and merging on 
first key.  Some bit of extra storage and network traffic needed there.

Alternative is to sort on first key, then, before feeding values to 
reducer, sort the values (for each unique first key) on the second key.

This not only saves space, but also saves sorting time, because sorting 
n*m elements (two keys at the same time) will take more time than 
sorting on n elements and then for each of n sorting on m elements.

Is there a reason against this alternative for secondary sort?

Thanks,
Le