You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell tot Westerflier (JIRA)" <ji...@apache.org> on 2015/05/22 20:39:17 UTC

[jira] [Commented] (SPARK-4233) Simplify the Aggregation Function implementation

    [ https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556588#comment-14556588 ] 

Herman van Hovell tot Westerflier commented on SPARK-4233:
----------------------------------------------------------

Hi, 

I have looked through the code in the PR. The new interface doesn't look simpler to me. It seems that it has been design with Hive UDAFs in mind.

Can you explain to me why the current UDAF implementation is complicated, why it needs to change, and what is improved if we start to use the proposed implementation?

As for the distinct implementations. Why not nest the required aggregation operator in a distinct operator? For instance:
{code}
case class DistinctifyFunction(
    @transient expr: Seq[Expression],
    @transient aggr: AggregateFunction
    @transient base: AggregateExpression)
  extends AggregateFunction {

  def this() = this(null, null) // Required for serialization.

  val seen = new OpenHashSet[Any]()

  @transient
  val distinctValue = new InterpretedProjection(expr)

  override def update(input: Row): Unit = {
    val evaluatedExpr = distinctValue(input)
    if (!evaluatedExpr.anyNull) {
      seen.add(evaluatedExpr)
    }
  }

  override def eval(input: Row): Any = {
    // Assume the AggregateFunction input has been rerouted, to the distinct value projection.
    seen.foreach(aggr.update(_))
    aggr.eval(input)
  }
}
{code}

> Simplify the Aggregation Function implementation
> ------------------------------------------------
>
>                 Key: SPARK-4233
>                 URL: https://issues.apache.org/jira/browse/SPARK-4233
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Cheng Hao
>
> Currently, the UDAF implementation is quite complicated, and we have to provide distinct & non-distinct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org