You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Zhong (JIRA)" <ji...@apache.org> on 2016/08/22 15:20:21 UTC

[jira] [Created] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

Sean Zhong created SPARK-17187:
----------------------------------

             Summary: Support using arbitrary Java object as internal aggregation buffer object
                 Key: SPARK-17187
                 URL: https://issues.apache.org/jira/browse/SPARK-17187
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Sean Zhong


*Background*

For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types.
2. It is hard to reuse aggregation class definition defined in existing libraries like algebird.
3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:
1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like:
    a. It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird.
    b. We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee the performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org