You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mansur Ashraf (JIRA)" <ji...@apache.org> on 2016/12/05 23:54:58 UTC

[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

    [ https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723786#comment-15723786 ] 

Mansur Ashraf commented on SPARK-18728:
---------------------------------------

Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of job on Spark 1.6 that are using Algebird Aggregators through `aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators are composable (meaning you can combine X number of aggregators to get 1 combined aggregators), in our jobs we are combining 10+ number of aggregators and doing single pass aggregations using aggregateByKey/combineByKey. As we upgrade to Spark 2.0.0 and new Dataset API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset), we find out that aggregateByKey/combineByKey are all gone so we cant pass algebird aggregators directly, instead there is a new aggregator API based on algebird except (as far as I can tell) does not allow joining multiple aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating its own or allow users to pass algebird aggregators in Dataset API in addition to Spark aggregators

Thanks

> Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18728
>                 URL: https://issues.apache.org/jira/browse/SPARK-18728
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Alex Levenson
>            Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: https://github.com/twitter/algebird"
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org