You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by David Whiting <da...@gmail.com> on 2014/07/30 15:52:38 UTC

Use of Iterable with combine in Scrunch

The Scrunch version of combine accepts a function Iterable[V] => V . This
causes a lot of unexpected behaviour because the iterable that is wrapped
is actually a SingleUseIterable, and much of Scala's collection function
implementations actually try and access the underlying iterator multiple
times if they know that it's possible. This leads to often having to write
code like this:

...
.groupByKey()
.combine { _.iterator reduce { _ + _ } }

This is a silly example of course, because there's an Aggregator for
summation, but if your reduce function is more complex you have to do this
indirection via iterator in order to get correct behaviour.

Possible fixes:
a) Change combine to accept a function TraversableOnce[V] => V or
Iterator[V] => V, better reflecting the single-use nature of the underlying
Iterable
b) Given that most custom combines will in fact be folds over monoids, we
could promote the notion of reduce or fold up the the PGroupedTable itself,
so you can do .groupByKey().foldValues(_+_)

Re: Use of Iterable with combine in Scrunch

Posted by Josh Wills <jw...@cloudera.com>.
I would be in favor of both changes, if no one else minded. I love that
Scala allows us to express the TraversableOnce concept, and I'm all for
using it in this instance.

While we're on the subject of combiners, I was toying with the idea of a
GenericAggregator, which would basically encapsulate a combine + a map so
that you could do things like averages.

Oh, and slightly related, I hope you saw where I pulled in Algebird for
monoid calculations in Scrunch in
https://issues.apache.org/jira/browse/CRUNCH-424

J

J


On Wed, Jul 30, 2014 at 6:52 AM, David Whiting <da...@gmail.com>
wrote:

> The Scrunch version of combine accepts a function Iterable[V] => V . This
> causes a lot of unexpected behaviour because the iterable that is wrapped
> is actually a SingleUseIterable, and much of Scala's collection function
> implementations actually try and access the underlying iterator multiple
> times if they know that it's possible. This leads to often having to write
> code like this:
>
> ...
> .groupByKey()
> .combine { _.iterator reduce { _ + _ } }
>
> This is a silly example of course, because there's an Aggregator for
> summation, but if your reduce function is more complex you have to do this
> indirection via iterator in order to get correct behaviour.
>
> Possible fixes:
> a) Change combine to accept a function TraversableOnce[V] => V or
> Iterator[V] => V, better reflecting the single-use nature of the underlying
> Iterable
> b) Given that most custom combines will in fact be folds over monoids, we
> could promote the notion of reduce or fold up the the PGroupedTable itself,
> so you can do .groupByKey().foldValues(_+_)
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>