You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tim Wintle <ti...@teamrubber.com> on 2009/03/11 21:25:45 UTC

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

On Tue, 2009-03-10 at 19:44 -0700, Gyanit wrote:
> I have large number of key,value pairs. I don't actually care if data goes in
> value or key. Let me be more exact. 
> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
> pair. I can put it in keys or values.
> I have experimented with both options (heavy key , light value)  vs (light
> key, heavy value). It turns out that hk,lv option is much much better than
> (lk,hv). 
<snip>
> There is a difference of time in shuffle phase. Which is weird as amount of
> data transferred is same.

just an idea, but is this related to the hash function? are there the
same number of reducers no matter which you do?

As I understand it the reducers merge-sort the data while the shuffle is
happening, so if each reducer has to sort less data this could be part
of it.