You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mathias Herberts <ma...@gmail.com> on 2011/10/29 13:35:22 UTC

Grouping in Combiners

Another point concerning the Combiners,

the grouping is currently done using the RawComparator used for
sorting the Mapper's output. Wouldn't it be useful to be able to set a
custom CombinerGroupingComparatorClass?

Mathias.

Re: Grouping in Combiners

Posted by Shevek <sh...@karmasphere.com>.
On 31 October 2011 14:13, Mathias Herberts <ma...@gmail.com>wrote:

> Not one I can think of at the moment, but intuitively that's the kind of
> flexibility I would like to see should a grouping comparator become
> configurable for Combiners.
>

I don't think there's a good example, because contractually, IF the
combiner uses a different grouping comparator, then either

a) It has to be a sub-comparator of the reduce grouping comparator, in
which case the reduction is nonoptimal - there are algorithmic arguments
for doing this, but they can all just as easily be resolved using the type
system as messing with the structure of the dataflow.

b) It's a super-comparator (i.e. compares fewer fields and generates larger
groups) of the reduce grouping comparator, in which case it has memory
visible effects on the reducer since groups have now vanished which should
have been present without the "optimization".

One case is suboptimal, the other case is incorrect.


> By pushing your reasoning further, why specify a combiner class at all
> instead of applying the Reducer on the map side.
>

I'll give a concrete example to explain why one doesn't just use a reducer
on the map side: The type contract is different.

Average, as a PartialAggregate: Int -> Partial -> Double

Mapper converts Int to a Partial { value : 1 }

Combiner converts a set of Partials (and possibly, but unlikely Ints) to a
Partial { sum(values) : sum(counts) }

Reducer converts a set of Partials to a Double: sum(values) / sum(counts)

The combiner is idempotent in types, the reducer is not. But the output
from the combiner is not the desired result.

Obviously you can build a reducer out of the Combiner plus some extra code,
but I think that fact is a distraction from the question. Alternatively,
you could do it the other way around and say "Run the combiner on the
reduce side, and force the reducer to accept a single Partial as input, not
a list of Partials." but again I think one can fake up an algorithm where
the reducer needs multiple inputs which cannot easily be merged into a
single type by the combiner (Well, I think given a complex enough combiner
type, this could be wrong... I'll think about it.)

S.


> On Oct 31, 2011 10:02 PM, "Shevek" <sh...@karmasphere.com> wrote:
>
> > On 31 October 2011 13:37, Mathias Herberts <mathias.herberts@gmail.com
> > >wrote:
> >
> > > I don't know if it's a bug but I'd rather have the ability to set a
> > > Combiner specific group comparator than to have the Combiner use the
> > group
> > > comparator set for the Reducer.
> > > On Oct 31, 2011 9:21 PM, "Harsh J" <ha...@cloudera.com> wrote:
> > >
> >
> > Now I'm curious. Can you argue that there's a case where it makes a
> > difference? Preferably one where it can't be trivially curried into the
> > combiner?
> >
> > S.
> >
> >
> > > > Shevek,
> > > >
> > > > The problem Mathias indicates here is that the Combiners do not
> utilize
> > > > the Grouping Comparators. They only use the Sort Comparators. Is that
> > > > probably a bug is what I wonder.
> > > >
> > > > On 31-Oct-2011, at 11:14 PM, Shevek wrote:
> > > >
> > > > > I like the ability to reuse a Java component for both sorting and
> > > > grouping,
> > > > > and to be honest, since the cases where one can do a comparison
> > without
> > > > > deserializing the raw bytes are relatively few and far between, I
> > tend
> > > to
> > > > > use java's Comparator interface, and wrap it in some
> > > > > infrastructure-specific adapter. I have a vague feeling that Hadoop
> > > > > sometimes calls the byte interface and sometimes the object
> interface
> > > > > anyway? ICBW, the way I've been writing code makes it irrelevant.
> > > > >
> > > > > Alternatively, I've misunderstood the (simpler) question, and the
> > > answer
> > > > is
> > > > > to use the setGroupingComparatorClass() API.
> > > > >
> > > > > S.
> > > > >
> > > > > On 29 October 2011 04:35, Mathias Herberts <
> > mathias.herberts@gmail.com
> > > > >wrote:
> > > > >
> > > > >> Another point concerning the Combiners,
> > > > >>
> > > > >> the grouping is currently done using the RawComparator used for
> > > > >> sorting the Mapper's output. Wouldn't it be useful to be able to
> > set a
> > > > >> custom CombinerGroupingComparatorClass?
> > > > >>
> > > > >> Mathias.
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Grouping in Combiners

Posted by Mathias Herberts <ma...@gmail.com>.
Not one I can think of at the moment, but intuitively that's the kind of
flexibility I would like to see should a grouping comparator become
configurable for Combiners.

By pushing your reasoning further, why specify a combiner class at all
instead of applying the Reducer on the map side.
On Oct 31, 2011 10:02 PM, "Shevek" <sh...@karmasphere.com> wrote:

> On 31 October 2011 13:37, Mathias Herberts <mathias.herberts@gmail.com
> >wrote:
>
> > I don't know if it's a bug but I'd rather have the ability to set a
> > Combiner specific group comparator than to have the Combiner use the
> group
> > comparator set for the Reducer.
> > On Oct 31, 2011 9:21 PM, "Harsh J" <ha...@cloudera.com> wrote:
> >
>
> Now I'm curious. Can you argue that there's a case where it makes a
> difference? Preferably one where it can't be trivially curried into the
> combiner?
>
> S.
>
>
> > > Shevek,
> > >
> > > The problem Mathias indicates here is that the Combiners do not utilize
> > > the Grouping Comparators. They only use the Sort Comparators. Is that
> > > probably a bug is what I wonder.
> > >
> > > On 31-Oct-2011, at 11:14 PM, Shevek wrote:
> > >
> > > > I like the ability to reuse a Java component for both sorting and
> > > grouping,
> > > > and to be honest, since the cases where one can do a comparison
> without
> > > > deserializing the raw bytes are relatively few and far between, I
> tend
> > to
> > > > use java's Comparator interface, and wrap it in some
> > > > infrastructure-specific adapter. I have a vague feeling that Hadoop
> > > > sometimes calls the byte interface and sometimes the object interface
> > > > anyway? ICBW, the way I've been writing code makes it irrelevant.
> > > >
> > > > Alternatively, I've misunderstood the (simpler) question, and the
> > answer
> > > is
> > > > to use the setGroupingComparatorClass() API.
> > > >
> > > > S.
> > > >
> > > > On 29 October 2011 04:35, Mathias Herberts <
> mathias.herberts@gmail.com
> > > >wrote:
> > > >
> > > >> Another point concerning the Combiners,
> > > >>
> > > >> the grouping is currently done using the RawComparator used for
> > > >> sorting the Mapper's output. Wouldn't it be useful to be able to
> set a
> > > >> custom CombinerGroupingComparatorClass?
> > > >>
> > > >> Mathias.
> > > >>
> > >
> > >
> >
>

Re: Grouping in Combiners

Posted by Shevek <sh...@karmasphere.com>.
On 31 October 2011 13:37, Mathias Herberts <ma...@gmail.com>wrote:

> I don't know if it's a bug but I'd rather have the ability to set a
> Combiner specific group comparator than to have the Combiner use the group
> comparator set for the Reducer.
> On Oct 31, 2011 9:21 PM, "Harsh J" <ha...@cloudera.com> wrote:
>

Now I'm curious. Can you argue that there's a case where it makes a
difference? Preferably one where it can't be trivially curried into the
combiner?

S.


> > Shevek,
> >
> > The problem Mathias indicates here is that the Combiners do not utilize
> > the Grouping Comparators. They only use the Sort Comparators. Is that
> > probably a bug is what I wonder.
> >
> > On 31-Oct-2011, at 11:14 PM, Shevek wrote:
> >
> > > I like the ability to reuse a Java component for both sorting and
> > grouping,
> > > and to be honest, since the cases where one can do a comparison without
> > > deserializing the raw bytes are relatively few and far between, I tend
> to
> > > use java's Comparator interface, and wrap it in some
> > > infrastructure-specific adapter. I have a vague feeling that Hadoop
> > > sometimes calls the byte interface and sometimes the object interface
> > > anyway? ICBW, the way I've been writing code makes it irrelevant.
> > >
> > > Alternatively, I've misunderstood the (simpler) question, and the
> answer
> > is
> > > to use the setGroupingComparatorClass() API.
> > >
> > > S.
> > >
> > > On 29 October 2011 04:35, Mathias Herberts <mathias.herberts@gmail.com
> > >wrote:
> > >
> > >> Another point concerning the Combiners,
> > >>
> > >> the grouping is currently done using the RawComparator used for
> > >> sorting the Mapper's output. Wouldn't it be useful to be able to set a
> > >> custom CombinerGroupingComparatorClass?
> > >>
> > >> Mathias.
> > >>
> >
> >
>

Re: Grouping in Combiners

Posted by Mathias Herberts <ma...@gmail.com>.
I don't know if it's a bug but I'd rather have the ability to set a
Combiner specific group comparator than to have the Combiner use the group
comparator set for the Reducer.
On Oct 31, 2011 9:21 PM, "Harsh J" <ha...@cloudera.com> wrote:

> Shevek,
>
> The problem Mathias indicates here is that the Combiners do not utilize
> the Grouping Comparators. They only use the Sort Comparators. Is that
> probably a bug is what I wonder.
>
> On 31-Oct-2011, at 11:14 PM, Shevek wrote:
>
> > I like the ability to reuse a Java component for both sorting and
> grouping,
> > and to be honest, since the cases where one can do a comparison without
> > deserializing the raw bytes are relatively few and far between, I tend to
> > use java's Comparator interface, and wrap it in some
> > infrastructure-specific adapter. I have a vague feeling that Hadoop
> > sometimes calls the byte interface and sometimes the object interface
> > anyway? ICBW, the way I've been writing code makes it irrelevant.
> >
> > Alternatively, I've misunderstood the (simpler) question, and the answer
> is
> > to use the setGroupingComparatorClass() API.
> >
> > S.
> >
> > On 29 October 2011 04:35, Mathias Herberts <mathias.herberts@gmail.com
> >wrote:
> >
> >> Another point concerning the Combiners,
> >>
> >> the grouping is currently done using the RawComparator used for
> >> sorting the Mapper's output. Wouldn't it be useful to be able to set a
> >> custom CombinerGroupingComparatorClass?
> >>
> >> Mathias.
> >>
>
>

Re: Grouping in Combiners

Posted by Shevek <sh...@karmasphere.com>.
On 31 October 2011 13:20, Harsh J <ha...@cloudera.com> wrote:

> Shevek,
>
> The problem Mathias indicates here is that the Combiners do not utilize
> the Grouping Comparators. They only use the Sort Comparators. Is that
> probably a bug is what I wonder.
>


Hm, that's a rather interesting definitional question.

The pure structure is: ... -> scatter -> gather -> ...

For any sort of structural messing around, we're allowed to do anything we
like with the scatter-gather pair, as long as there are no visible effects
on the input or output. So...

For clusters, it becomes ... -> scatter -> shuffle -> gather -> ... because
"shuffle -> gather" is a distributed gather. Sorting is a secondary issue
which only becomes visible to the reducer after the gather. Google modified
their mapreduce to make sorting optional, IIRC, made Tenzing a lot faster,
so let's do that:

... -> scatter -> gather [-> sort-within-group] -> ...

Square brackets are optional, as usual. So, the use of the sorting
comparator for the scatter-gather is a mathematical oddity. We are now free
to construct a gather using any mechanism which produces groups, e.g. bins,
etc. (Note general case that arbitrary values can't be gathered faster than
sorting still applies.)

The combiner is inserted between scatter and shuffle. If we construct:

... -> scatter -> partial-gather [[-> sort-within-group] -> combine] ->
scatter[-groups] -> gather[-groups] [-> sort-within-groups] -> ...

then perhaps the use of the sorting comparator for combiners really is a
bug. Note the extra sort-within-group is optional in case the combiner
needs the sort.

If we use the above structural layout of mapreduce, we can also remove the
restriction that the sorting and grouping comparators must agree on the
subset of the key on which they compare, which I _think_ is present
(correct me as usual, please).

Perhaps we should pin a copy of this diatribe on the back of the toilet
door in case anyone else knows the answer, but in conclusion, I think it
_is_ a bug.

S.


> On 31-Oct-2011, at 11:14 PM, Shevek wrote:
>
> > I like the ability to reuse a Java component for both sorting and
> grouping,
> > and to be honest, since the cases where one can do a comparison without
> > deserializing the raw bytes are relatively few and far between, I tend to
> > use java's Comparator interface, and wrap it in some
> > infrastructure-specific adapter. I have a vague feeling that Hadoop
> > sometimes calls the byte interface and sometimes the object interface
> > anyway? ICBW, the way I've been writing code makes it irrelevant.
> >
> > Alternatively, I've misunderstood the (simpler) question, and the answer
> is
> > to use the setGroupingComparatorClass() API.
> >
> > S.
> >
> > On 29 October 2011 04:35, Mathias Herberts <mathias.herberts@gmail.com
> >wrote:
> >
> >> Another point concerning the Combiners,
> >>
> >> the grouping is currently done using the RawComparator used for
> >> sorting the Mapper's output. Wouldn't it be useful to be able to set a
> >> custom CombinerGroupingComparatorClass?
> >>
> >> Mathias.
> >>
>
>

Re: Grouping in Combiners

Posted by Harsh J <ha...@cloudera.com>.
Shevek,

The problem Mathias indicates here is that the Combiners do not utilize the Grouping Comparators. They only use the Sort Comparators. Is that probably a bug is what I wonder.

On 31-Oct-2011, at 11:14 PM, Shevek wrote:

> I like the ability to reuse a Java component for both sorting and grouping,
> and to be honest, since the cases where one can do a comparison without
> deserializing the raw bytes are relatively few and far between, I tend to
> use java's Comparator interface, and wrap it in some
> infrastructure-specific adapter. I have a vague feeling that Hadoop
> sometimes calls the byte interface and sometimes the object interface
> anyway? ICBW, the way I've been writing code makes it irrelevant.
> 
> Alternatively, I've misunderstood the (simpler) question, and the answer is
> to use the setGroupingComparatorClass() API.
> 
> S.
> 
> On 29 October 2011 04:35, Mathias Herberts <ma...@gmail.com>wrote:
> 
>> Another point concerning the Combiners,
>> 
>> the grouping is currently done using the RawComparator used for
>> sorting the Mapper's output. Wouldn't it be useful to be able to set a
>> custom CombinerGroupingComparatorClass?
>> 
>> Mathias.
>> 


Re: Grouping in Combiners

Posted by Shevek <sh...@karmasphere.com>.
I like the ability to reuse a Java component for both sorting and grouping,
and to be honest, since the cases where one can do a comparison without
deserializing the raw bytes are relatively few and far between, I tend to
use java's Comparator interface, and wrap it in some
infrastructure-specific adapter. I have a vague feeling that Hadoop
sometimes calls the byte interface and sometimes the object interface
anyway? ICBW, the way I've been writing code makes it irrelevant.

Alternatively, I've misunderstood the (simpler) question, and the answer is
to use the setGroupingComparatorClass() API.

S.

On 29 October 2011 04:35, Mathias Herberts <ma...@gmail.com>wrote:

> Another point concerning the Combiners,
>
> the grouping is currently done using the RawComparator used for
> sorting the Mapper's output. Wouldn't it be useful to be able to set a
> custom CombinerGroupingComparatorClass?
>
> Mathias.
>