You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mithila Nagendra <mn...@asu.edu> on 2009/03/07 21:03:49 UTC

Does reduce start only after the map is completed?

Hey all
Im using the hadoop version 0.18.3, and was wondering if the reduce phase
starts only after the mapping is completed? Is it required that the Map
phase is a 100% done, or can it be programmed in such a way that the reduce
starts earlier?

Thanks!
Mithila Nagendra
Arizona State University

Re: Does reduce start only after the map is completed?

Posted by Tim Wintle <ti...@teamrubber.com>.
On Sat, 2009-03-07 at 23:03 +0300, Mithila Nagendra wrote:
> Hey all
> Im using the hadoop version 0.18.3, and was wondering if the reduce phase
> starts only after the mapping is completed? Is it required that the Map
> phase is a 100% done, or can it be programmed in such a way that the reduce
> starts earlier?

As I understand it, the reducers have three phases:

 1) Copy Data from the mappers ("Shuffle")
 2) Sort the data on the reducer (by key)
 3) Actually run the data through the function you've defined.

<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Reducer.html>

The Reducer tasks/processes start as soon as they are able to (I
believe), and copying data and sorting happens while there may still be
mappers running.

Stage (3) cannot be run until stage (2) is completed, which can
obviously not happen until all the mappers are complete.

In my experience, I haven't found this a major issue (especially if
there are many times more mappers than machines), since the shuffle and
sort stages take significant time and effort anyway.


Tim Wintle


Re: Does reduce start only after the map is completed?

Posted by pa...@gmail.com.
On Sat, 07 Mar 2009 20:03:49 -0000, Mithila Nagendra <mn...@asu.edu>  
wrote:

> Hey all
> Im using the hadoop version 0.18.3, and was wondering if the reduce phase
> starts only after the mapping is completed? Is it required that the Map
> phase is a 100% done, or can it be programmed in such a way that the  
> reduce
> starts earlier?
>
> Thanks!
> Mithila Nagendra
> Arizona State University

As i can imagine, Reduce Phase starts immediately at Job starts and waits  
data
 from several Mappers. Say, you sonfigured system to run 2 reducers and 5  
mappers.
When Job starts, 2 reducers also starts: one of them waits results from  
some 2 maps, other one
waits results from other 3 maps. Between starts and stops of various  
Mappers, the 2 Reducers alive
and collecting data from Mappers. After all 5 Mappers "eats" all the input  
data, reducers terminates...