You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Arun C Murthy <ac...@yahoo-inc.com> on 2011/01/04 09:07:48 UTC

Re: When a Reduce Task starts?

On Dec 23, 2010, at 9:20 PM, pig wrote:
> For some special reduce jobs that do not rely on the order of  
> (key,value) pairs,  the sort phase is of no use.
> In this situation, theoretically speaking, reduce can be started  
> before all of the map task finished.
> But why hadoop doesn't support this feature? For example, it may be  
> specified as an argument when committing a job.
>

Several reasons...

A major problem is errors - a map may fail after it's output has been  
'shuffled' by some reduces, not all (i.e. copied by some reduces). In  
this case, it's really hard to track and discard duplicate key/value  
pairs.

The behaviour you seek is quite easy to model by running map-only  
jobs, saving their output to HDFS and processing in the next job -  
albeit with some performance penalties. But, this keeps the MR  
framework very simple and stable.

Arun