You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Arun C Murthy <ac...@hortonworks.com> on 2011/07/10 22:37:34 UTC

Re: About the combiner execution

(Moving to mapreduce-user@, bcc hdfs-user@. Please use appropriate project lists - thanks)

On Jul 10, 2011, at 4:42 AM, Florin P wrote:

> Hello!
>  I've read on http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html (cite):
> "The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution. "
> Is it true? 

Right. The way to visualize is that the MR framework in the map task collects the 'raw' (i.e. serialized) map-output key-values in the 'sort' buffer. When the buffer is full it runs the combiner (if available) and then spills it to the disk, even the last (final) spill. The combiner is also run when the multiple spills from disk need to be merged. 

However, the combiner execution also depends on having sufficient number of records to combine - this is because combiner execution is somewhat expensive since we need a extra serialize-deserialize pair.

Thus, the combiner maybe be run 0 or more times. 

> Also is it possible to use the Combiner without the Reducer? The framework will take into the consideration the Combiner in this case?


No. When the job has no reduces the map-outputs are written straight to HDFS (typically) without sorting them. Thus, combiners are never in that execution path.

hth,
Arun