You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Allen <an...@gmail.com> on 2011/11/02 04:07:11 UTC

sort and merge in map/reduce

Hello,

I am curious about what is going on after the map puts key value pair
to the collector. I know there is something called spill and sort
merge happen. But I don't get a clear picture. My understanding is a
partitioner divides the key value pairs (map output) to several
"groups". Each "group" which will be sent to a particular reducer. For
each "group", the MapTask will sort the key value pair based on key
(why???) and materialized on local disk. I don't know where the merge
steps in and why we need merge.

On the reduce side, there is also a sort and merge step. Why is that necessary?

Thanks for helping me.

-- 
Allen