You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Ren Zuocheng <aw...@gmail.com> on 2010/11/23 04:29:59 UTC

How does Hadoop do sort(shuffle) after map exactly?

Hi, 
I'm new to Hadoop and I want to know its implementation better. I was always wondering after mapping, how each reduce task get its input. It is said in google's paper and hadoop's documentation that a sort is done to aggregate the same key of the map output. But there is no detailed explanation of how it is implemented and my intuition is that perhaps a global hashing will work better than sorting. So I really want to know the details and see whether my intuition is right. If I can find out that in the source code, where should I start with?

Sent from my iPhone

Re: How does Hadoop do sort(shuffle) after map exactly?

Posted by Harsh J <qw...@gmail.com>.

Hi,

On Tue, Nov 23, 2010 at 8:59 AM, Ren Zuocheng <aw...@gmail.com> wrote:
> Hi,
> I'm new to Hadoop and I want to know its implementation better. I was always wondering after mapping, how each reduce task get its input. It is said in google's paper and hadoop's documentation that a sort is done to aggregate the same key of the map output. But there is no detailed explanation of how it is implemented and my intuition is that perhaps a global hashing will work better than sorting. So I really want to know the details and see whether my intuition is right. If I can find out that in the source code, where should I start with?
>

Read the MapTask and ReduceTask classes, and you'll find how the
Shuffle is implemented within.

The whole process is like this if am right:
MapTask                                                         || ReduceTask
Map Emit -> Partition -> (Combine) Sort -> Spill || Shuffle (Combine)
-> Merge -> Reduce

-- 
Harsh J
www.harshj.com