You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2016/07/02 21:24:54 UTC

Dataset and Aggregator API pain points

after working with the Dataset and Aggregator apis for a few weeks porting
some fairly complex RDD algos (an overall pleasant experience) i wanted to
summarize the pain points and some suggestions for improvement given my
experience. all of these are already mentioned on mailing list or jira, but
i figured its good to put them in one place.
see below.
best,
koert

*) a lot of practical aggregation functions do not have a zero. this can be
dealt with correctly using null or None as the zero for Aggregator. in
algebird for example this is expressed as converting an algebird.Aggregator
(which does not have a zero) into a algebird.MonoidAggregator (which does
have a zero, so similar to spark Aggregator) by lifting it. see:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420
something similar should be possible in spark. however currently Aggregator
does not like its zero to be null or an Option, making this approach
difficult. see:
https://www.mail-archive.com/user@spark.apache.org/msg53106.html
https://issues.apache.org/jira/browse/SPARK-15810

*) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably
using an Aggregator (with null or None as the zero) under the hood. the
current implementation does a flatMapGroups which is suboptimal.

*) KeyValueGroupedDataset needs mapValues. without this porting many algos
from RDD to Dataset is difficult and clumsy. see:
https://issues.apache.org/jira/browse/SPARK-15780

*) Aggregators need to also work within DataFrames (so
RelationalGroupedDataset) without having to fall back on using Row objects
as input. otherwise all code ends up being written twice, once for
Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't
make sense to me. my attempt at addressing this:
https://issues.apache.org/jira/browse/SPARK-15769
https://github.com/apache/spark/pull/13512

best,
koert

Re: Dataset and Aggregator API pain points

Posted by Reynold Xin <rx...@databricks.com>.
See https://issues.apache.org/jira/browse/SPARK-16390

On Sat, Jul 2, 2016 at 6:35 PM, Reynold Xin <rx...@databricks.com> wrote:

> Thanks, Koert, for the great email. They are all great points.
>
> We should probably create an umbrella JIRA for easier tracking.
>
>
> On Saturday, July 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>
>> after working with the Dataset and Aggregator apis for a few weeks
>> porting some fairly complex RDD algos (an overall pleasant experience) i
>> wanted to summarize the pain points and some suggestions for improvement
>> given my experience. all of these are already mentioned on mailing list or
>> jira, but i figured its good to put them in one place.
>> see below.
>> best,
>> koert
>>
>> *) a lot of practical aggregation functions do not have a zero. this can
>> be dealt with correctly using null or None as the zero for Aggregator. in
>> algebird for example this is expressed as converting an algebird.Aggregator
>> (which does not have a zero) into a algebird.MonoidAggregator (which does
>> have a zero, so similar to spark Aggregator) by lifting it. see:
>>
>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420
>> something similar should be possible in spark. however currently
>> Aggregator does not like its zero to be null or an Option, making this
>> approach difficult. see:
>> https://www.mail-archive.com/user@spark.apache.org/msg53106.html
>> https://issues.apache.org/jira/browse/SPARK-15810
>>
>> *) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably
>> using an Aggregator (with null or None as the zero) under the hood. the
>> current implementation does a flatMapGroups which is suboptimal.
>>
>> *) KeyValueGroupedDataset needs mapValues. without this porting many
>> algos from RDD to Dataset is difficult and clumsy. see:
>> https://issues.apache.org/jira/browse/SPARK-15780
>>
>> *) Aggregators need to also work within DataFrames (so
>> RelationalGroupedDataset) without having to fall back on using Row objects
>> as input. otherwise all code ends up being written twice, once for
>> Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't
>> make sense to me. my attempt at addressing this:
>> https://issues.apache.org/jira/browse/SPARK-15769
>> https://github.com/apache/spark/pull/13512
>>
>> best,
>> koert
>>
>>

Re: Dataset and Aggregator API pain points

Posted by Reynold Xin <rx...@databricks.com>.
Thanks, Koert, for the great email. They are all great points.

We should probably create an umbrella JIRA for easier tracking.

On Saturday, July 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:

> after working with the Dataset and Aggregator apis for a few weeks porting
> some fairly complex RDD algos (an overall pleasant experience) i wanted to
> summarize the pain points and some suggestions for improvement given my
> experience. all of these are already mentioned on mailing list or jira, but
> i figured its good to put them in one place.
> see below.
> best,
> koert
>
> *) a lot of practical aggregation functions do not have a zero. this can
> be dealt with correctly using null or None as the zero for Aggregator. in
> algebird for example this is expressed as converting an algebird.Aggregator
> (which does not have a zero) into a algebird.MonoidAggregator (which does
> have a zero, so similar to spark Aggregator) by lifting it. see:
>
> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420
> something similar should be possible in spark. however currently
> Aggregator does not like its zero to be null or an Option, making this
> approach difficult. see:
> https://www.mail-archive.com/user@spark.apache.org/msg53106.html
> https://issues.apache.org/jira/browse/SPARK-15810
>
> *) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably
> using an Aggregator (with null or None as the zero) under the hood. the
> current implementation does a flatMapGroups which is suboptimal.
>
> *) KeyValueGroupedDataset needs mapValues. without this porting many algos
> from RDD to Dataset is difficult and clumsy. see:
> https://issues.apache.org/jira/browse/SPARK-15780
>
> *) Aggregators need to also work within DataFrames (so
> RelationalGroupedDataset) without having to fall back on using Row objects
> as input. otherwise all code ends up being written twice, once for
> Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't
> make sense to me. my attempt at addressing this:
> https://issues.apache.org/jira/browse/SPARK-15769
> https://github.com/apache/spark/pull/13512
>
> best,
> koert
>
>