You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Harish Mallipeddi <ha...@gmail.com> on 2009/09/12 08:11:57 UTC

What's a valid combiner?

Hi,
I was looking at how the Combiner gets run inside Hadoop. Since there's no
separate Combiner interface, it basically seems to give the user the
impression any valid Reducer can be used as a Combiner. There's a little
"warning" note added to the docs under the setCombinerClass() method as to
what a combiner should not do:

"The framework may invoke the combiner 0, 1, or multiple times, in both the
mapper and reducer tasks. In general, the combiner is called as the
sort/merge result is written to disk. The combiner must:
* be side-effect free
* have the same input and output key types and the same input and output
value types
Typically the combiner is same as the Reducer for the job i.e.
setReducerClass(Class)."

But after reading through some of the Hadoop code, it seems like there are
many more restrictions on what a valid combiner is. Correct me if I'm wrong
but a combiner can potentially get run 3 different times:

1) Map-side - when a map-side spill is about to occur, combiner is run
inside each partition.
2) Map-side - when multiple spill files are being merged together.
3) Reduce-side - when multiple map-outputs are being merged before the
reduce() function is invoked.

In all these above cases, Hadoop is making an assumption that:

1) A combiner does not change the key. Apart from the fact that the input
and output key types should match (as already mentioned in the Java docs
above), it seems like the input and the output key should be exactly the
same. Because if the combiner modifies the key, the key basically ends up in
the wrong partition (because this new key is never rehashed to another
partition) and the entire merge-sort operation is affected downstream.

2) Since the combiner can be run multiple times or not be run at all, it
directly implies that the operation that is used to combine multiple values
into one value should be "associative & commutative"?

Why isn't there a separate Combiner interface? If not, maybe some of these
other gotchas should be documented somewhere?

I ran a simple experiment - I took the wordcount example and just modified
the Reducer such that it outputs the string key "reversed" and then used
this reducer as the combiner. It led to some funny results.

-- 
Harish Mallipeddi
http://blog.poundbang.in

Re: What's a valid combiner?

Posted by Owen O'Malley <om...@apache.org>.
On Sep 11, 2009, at 11:11 PM, Harish Mallipeddi wrote:

> 1) A combiner does not change the key.

Please file a jira asking for this to be documented. You are right,  
this is a requirement on combiners and it is not currently stated.

> 2) Since the combiner can be run multiple times or not be run at  
> all, it
> directly implies that the operation that is used to combine multiple  
> values
> into one value should be "associative & commutative"?

Clearly that is the intent. However "associative and commutative" only  
have meaning for us math geeks. In non-math terms, telling the user  
that it may be run an indeterminate number of times implies that it  
must have those properties in order to do anything reasonable.

> Why isn't there a separate Combiner interface? If not, maybe some of  
> these
> other gotchas should be documented somewhere?

Such a class wouldn't be very interesting:

class Combiner extends Reducer {
}

isn't very interesting. Furthermore, it is often the case that the  
Combiner is the same as the Reducer....

I did toy with the idea of making an attribute such as @SideEffectFree  
that could mark classes that are acceptable as combiners. There is an  
old jira about this somewhere...

> I ran a simple experiment - I took the wordcount example and just  
> modified
> the Reducer such that it outputs the string key "reversed" and then  
> used
> this reducer as the combiner. It led to some funny results.

No doubt.

-- Owen