You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mathias Herberts <ma...@gmail.com> on 2011/10/29 12:52:45 UTC

Combiners

Hi,

I'm designing a 'Hadoop MapReduce Poster', putting all pieces together
so people will easily be able to visualize the full M/R flow.

Concerning the combiners, I have a few points I'd like to have clarified.

If I'm not mistaken, the output of the Mapper is passed to the
Partitioner which will dispatch K,V into R partitions.

<K,V> for each partition then go through the set SortComparatorClass
for sorting.

If there is a combiner, the sorted output is grouped using the
SortComparatorClass (and not the GroupingComparatorClass as it's the
case in the Reducer) and passed to the combiner prior to be written to
the partition file.

My question is, what happens if the combiner outputs different keys
than what it is being fed? The output of the combiner will suffer two
flaws:

1. It won't be sorted
2. It might end up in the wrong partition

Since a Combiner is simply a Reducer with no other constraints,
nothing seems to prevent those 2 problems.

Is my understanding correct?

Mathias.

Re: Combiners

Posted by Shevek <sh...@karmasphere.com>.

On 30 October 2011 20:22, Owen O'Malley <ow...@hortonworks.com> wrote:

> On Sat, Oct 29, 2011 at 3:52 AM, Mathias Herberts <
> mathias.herberts@gmail.com> wrote:
>
> > My question is, what happens if the combiner outputs different keys
> > than what it is being fed? The output of the combiner will suffer two
> > flaws:
> >
> > 1. It won't be sorted
> > 2. It might end up in the wrong partition
> >
>
> Yes. We've talked about adding various checks, but I don't think anyone has
> added them. We obviously have the input key and one option would be to
> ignore the output key.
>
>
> > Since a Combiner is simply a Reducer with no other constraints,
> >
>
> That isn't true. Combiners are required to be:
>  1. Idempotent - The number of times the combiner is applied can't change
> the output
>  2. Transititive -  The order of the inputs can't change the output
>  3. Side-effect free - Combiners can't have side effects (or they won't be
> idempotent).
>  4. Preserve the sort order - They can't change the keys to disrupt the
> sort order
>  5. Preserve the partitioning - They can't change the keys to change the
> parititioning
>
> All 5 of them are required for combiners.
>

I can't add to Owen's reply about Hadoop, other than perhaps to suggest
"associative" might be the word?

However, I also mess with these things at the theoretical level for my own
entertainment. This same question caught me out at first, because
mathematically speaking, the ability to change the keys in the combiner
does not prevent an efficient implementation from being constructed, but
would make the output buffer management somewhat more complex. So if the
restriction seems nonobvious, that's why.

Also, if you're explaining pure mapreduce, it also works without sorting
(assuming the user application doesn't require sorted output), all you need
is a gather operator, but (skipping tricks like bucket sort) since the most
efficient way to gather is to use a comparison-sort, one might as well.

Back to the grindstone.

S.

Re: Combiners

Posted by Owen O'Malley <ow...@hortonworks.com>.

On Mon, Oct 31, 2011 at 5:41 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> Thanks for listing the 5 requirements, if you don't mind I'll add them
> to the Hadoop MapReducer Poster.
>

Sure.

Re: Combiners

Posted by Mathias Herberts <ma...@gmail.com>.

> Yes. We've talked about adding various checks, but I don't think anyone has
> added them. We obviously have the input key and one option would be to
> ignore the output key.

ok.

>
>> Since a Combiner is simply a Reducer with no other constraints,
>>
>
> That isn't true. Combiners are required to be:

When I meant 'Reducer with no other constraints' I meant
Job.setCombinerClass takes a Class<? extends Reducer>, not a Combiner
class which enforces stricter constraints.

Thanks for listing the 5 requirements, if you don't mind I'll add them
to the Hadoop MapReducer Poster.

Re: Combiners

Posted by Owen O'Malley <ow...@hortonworks.com>.

On Sat, Oct 29, 2011 at 3:52 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> My question is, what happens if the combiner outputs different keys
> than what it is being fed? The output of the combiner will suffer two
> flaws:
>
> 1. It won't be sorted
> 2. It might end up in the wrong partition
>

Yes. We've talked about adding various checks, but I don't think anyone has
added them. We obviously have the input key and one option would be to
ignore the output key.

> Since a Combiner is simply a Reducer with no other constraints,
>

That isn't true. Combiners are required to be:
  1. Idempotent - The number of times the combiner is applied can't change
the output
  2. Transititive -  The order of the inputs can't change the output
  3. Side-effect free - Combiners can't have side effects (or they won't be
idempotent).
  4. Preserve the sort order - They can't change the keys to disrupt the
sort order
  5. Preserve the partitioning - They can't change the keys to change the
parititioning

All 5 of them are required for combiners.

-- Owen