You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by marmbrus <gi...@git.apache.org> on 2015/11/09 04:06:46 UTC

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/9555

    [SPARK-11578] [SQL] User API for Typed Aggregation

    This PR adds a new interface for user-defined aggregations, that can be used in `DataFrame` and `Dataset` operations to take all of the elements of a group and reduce them to a single value.
    
    For example, the following aggregator extracts an `int` from a specific class and adds them up:
    
    ```scala
      case class Data(i: Int)
    
      val customSummer =  new Aggregator[Data, Int, Int] {
        def prepare(d: Data) = d.i
        def reduce(l: Int, r: Int) = l + r
        def present(r: Int) = r
      }.toColumn()
    
      val ds: Dataset[Data] = ...
      val aggregated = ds.select(customSummer)
    ```
    
    By using helper functions, users can make a generic `Aggregator` that works on any input type:
    
    ```scala
    /** An `Aggregator` that adds up any numeric type returned by the given function. */
    class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable {
      val numeric = implicitly[Numeric[N]]
      override def prepare(input: I): N = if (input == null) numeric.zero else f(input)
      override def reduce(l: N, r: N): N = numeric.plus(l, r)
      override def present(reduction: N): N = reduction
    }
    
    def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new SumOf(f).toColumn
    ```
    
    These aggregators can then be used alongside other built-in SQL aggregations.
    
    ```scala
    val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
    ds
      .groupBy(_._1)
      .agg(
        sum(_._2),                // The aggregator defined above.
        expr("sum(_2)").as[Int],  // A built-in dynatically typed aggregation.
        count("*"))               // A built-in statically typed aggregation.
      .collect()
    
    
    res0: ("a", 30, 30, 2L), ("b", 3, 3, 2L), ("c", 1, 1, 1L)
    ```
    
    The current implementation focuses on integrating this into the typed API, but currently only supports running aggregations that return a single long value as explained in `TypedAggregateExpression`.  This will be improved in a followup PR.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark dataset-useragg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9555
    
----
commit e76f4c50d5c45127f64d10f9895fb190bbc4167c
Author: Michael Armbrust <mi...@databricks.com>
Date:   2015-11-09T02:53:50Z

    [SPARK-11578] [SQL] User API for Typed Aggregation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44240372
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SumOf.scala ---
    @@ -0,0 +1,31 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.expressions.Aggregator
    +
    +/** An `Aggregator` that adds up any numeric type returned by the given function. */
    +class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable {
    --- End diff --
    
    In this case, why don't we just rewrite it to a codegen sum aggregate function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155186821
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155220740
  
    **[Test build #45429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45429/consoleFull)** for PR 9555 at commit [`c88e6c0`](https://github.com/apache/spark/commit/c88e6c0faf9344ffde9e0cc954f33e570866fe83).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154938000
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155249587
  
    **[Test build #45429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45429/consoleFull)** for PR 9555 at commit [`c88e6c0`](https://github.com/apache/spark/commit/c88e6c0faf9344ffde9e0cc954f33e570866fe83).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class MasterWebUI(`\n  * `public class JavaAFTSurvivalRegressionExample `\n  * `class TypedColumn[-T, U](expr: Expression, val encoder: Encoder[U]) extends Column(expr)`\n  * `case class TypedAggregateExpression(`\n  * `abstract class Aggregator[-A, B, C] `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155187488
  
    **[Test build #45409 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45409/consoleFull)** for PR 9555 at commit [`0a0a199`](https://github.com/apache/spark/commit/0a0a19944ab74fc9c937c4d21c91c4dece80bc1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154909028
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44238638
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -39,10 +39,10 @@ private[sql] object Column {
     }
     
     /**
    - * A [[Column]] where an [[Encoder]] has been given for the expected return type.
    + * A [[Column]] where an [[Encoder]] has been given for the expected input and return type.
      * @since 1.6.0
      */
    -class TypedColumn[T](expr: Expression)(implicit val encoder: Encoder[T]) extends Column(expr)
    +class TypedColumn[-T, U](expr: Expression, val encoder: Encoder[U]) extends Column(expr)
    --- End diff --
    
    Do we need a variance annotation here? Not much else uses it in the codebase. Also, it's not clear what the types T and U are (was is the "input type" of a column?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155218844
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44238871
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SumOf.scala ---
    @@ -0,0 +1,31 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.expressions.Aggregator
    +
    +/** An `Aggregator` that adds up any numeric type returned by the given function. */
    +class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable {
    --- End diff --
    
    Actually, achieving this might also require the Aggregator interface to be specialized somehow. Not sure whether that's worth doing in a public API.. but it would be nice to have sum() that is fast out of the box, maybe by implementing it in some other way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9555


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44331060
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -24,12 +26,34 @@ import scala.util.Try
     import org.apache.spark.annotation.Experimental
     import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
     import org.apache.spark.sql.catalyst.analysis.{UnresolvedFunction, Star}
    +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, Encoder}
     import org.apache.spark.sql.catalyst.expressions._
     import org.apache.spark.sql.catalyst.plans.logical.BroadcastHint
    +import org.apache.spark.sql.execution.aggregate.SumOf
     import org.apache.spark.sql.types._
     import org.apache.spark.util.Utils
     
     /**
    + * Ensures that java functions signatures for methods that now return a [[TypedColumn]] still have
    + * legacy equivalents in bytecode.  This compatibility is done by forcing the compiler to generate
    + * "bridge" methods due to the use of covariant return types.
    + *
    + * {{{
    + * In LegacyFunctions:
    + * public abstract org.apache.spark.sql.Column avg(java.lang.String);
    + *
    + * In functions:
    + * public static org.apache.spark.sql.TypedColumn<java.lang.Object, java.lang.Object> avg(...);
    + * }}}
    + *
    + * This allows us to use the same functions both in typed [[Dataset]] operations and untyped
    + * [[DataFrame]] operations when the return type for a given function is statically known.
    + */
    +abstract class LegacyFunctions {
    --- End diff --
    
    private[sql]
    
    can you also add inline coment sayikng we should remove this in spakr 2.?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44238765
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SumOf.scala ---
    @@ -0,0 +1,31 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.expressions.Aggregator
    +
    +/** An `Aggregator` that adds up any numeric type returned by the given function. */
    +class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable {
    --- End diff --
    
    This particular implementation of Sum will be pretty slow because of the Numeric -- any way we can add specialized ones for each numeric type? I imagine sum will be a pretty popular aggregation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155212888
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155249715
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44238545
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.expressions
    +
    +import org.apache.spark.sql.catalyst.encoders.{encoderFor, Encoder}
    +import org.apache.spark.sql.catalyst.expressions.aggregate.{Complete, AggregateExpression2}
    +import org.apache.spark.sql.execution.aggregate.TypedAggregateExpression
    +import org.apache.spark.sql.{Dataset, DataFrame, TypedColumn}
    +
    +/**
    + * A base class for user-defined aggregations, which can be used in [[DataFrame]] and [[Dataset]]
    + * operations to take all of the elements of a group and reduce them to a single value.
    + *
    + * For example, the following aggregator extracts an `int` from a specific class and adds them up:
    + * {{{
    + *   case class Data(i: Int)
    + *
    + *   val customSummer =  new Aggregator[Data, Int, Int] {
    + *     def prepare(d: Data) = d.i
    + *     def reduce(l: Int, r: Int) = l + r
    + *     def present(r: Int) = r
    + *   }.toColumn()
    + *
    + *   val ds: Dataset[Data]
    + *   val aggregated = ds.select(customSummer)
    + * }}}
    + *
    + * Based loosely on Aggregator from Algebird: https://github.com/twitter/algebird
    + *
    + * @tparam A The input type for the aggregation.
    + * @tparam B The type of the intermediate value of the reduction.
    + * @tparam C The type of the final result.
    + */
    +abstract class Aggregator[-A, B, C] {
    --- End diff --
    
    This particular interface for Aggregator isn't that efficient, because every A needs to be converted to a B with prepare() before being merged into another B. It would be better to have methods from A -> B, (B, A) -> B and (B, B) -> B, similar to some of the operations in RDD. Also, the "reduce" methods should be allowed to modify the left-hand "B" in-place.
    
    Another option is to add a "zero" method and then (B, A) -> B. What happens if you aggregate an empty list, are you just going to get null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155188203
  
    **[Test build #45409 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45409/consoleFull)** for PR 9555 at commit [`0a0a199`](https://github.com/apache/spark/commit/0a0a19944ab74fc9c937c4d21c91c4dece80bc1b).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class TypedColumn[-T, U](expr: Expression, val encoder: Encoder[U]) extends Column(expr)`\n  * `class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable `\n  * `case class TypedAggregateExpression(`\n  * `abstract class Aggregator[-A, B, C] `\n  * `abstract class LegacyFunctions `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155213421
  
    **[Test build #45424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45424/consoleFull)** for PR 9555 at commit [`f559f5a`](https://github.com/apache/spark/commit/f559f5af7a4cc65b575f8fb71c53c79bfc0cf7b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155212954
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44322647
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -39,10 +39,10 @@ private[sql] object Column {
     }
     
     /**
    - * A [[Column]] where an [[Encoder]] has been given for the expected return type.
    + * A [[Column]] where an [[Encoder]] has been given for the expected input and return type.
      * @since 1.6.0
      */
    -class TypedColumn[T](expr: Expression)(implicit val encoder: Encoder[T]) extends Column(expr)
    +class TypedColumn[-T, U](expr: Expression, val encoder: Encoder[U]) extends Column(expr)
    --- End diff --
    
    I think this is required if we want the ability to have typesafety and good inference for some expressions, while still allowing you to mix in the SQL expressions that are typechecked by the analyzer.
    
    If you get rid of `T` then I don't think you'd need extra annotations for something like `sum(_._2)` and it wouldn't be type checked.  If you get rid of the annotation then you won't be able to use `expr("sum(_2)")` without casting.
    
    I'll add some docs to clarify.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155186853
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154907616
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155240029
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155238990
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44323023
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SumOf.scala ---
    @@ -0,0 +1,31 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.expressions.Aggregator
    +
    +/** An `Aggregator` that adds up any numeric type returned by the given function. */
    +class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable {
    --- End diff --
    
    Yeah, this was mostly an example I wrote to make sure you could do flexible things.  I agree with reynolds that we can optimize common cases under the covers.
    
    Regarding specialization, the easiest way for us to do this is probably to add encoders that reuse objects under the covers.  Thats probably the nicest way to avoid boxing without getting into the mess that is scala specialization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154908108
  
    **[Test build #45333 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45333/consoleFull)** for PR 9555 at commit [`9beedee`](https://github.com/apache/spark/commit/9beedee960887255a4a88e901b27ef5e7c438f83).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154907986
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154907951
  
    /cc @rxin @mateiz 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44337520
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -24,12 +26,34 @@ import scala.util.Try
     import org.apache.spark.annotation.Experimental
     import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
     import org.apache.spark.sql.catalyst.analysis.{UnresolvedFunction, Star}
    +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, Encoder}
     import org.apache.spark.sql.catalyst.expressions._
     import org.apache.spark.sql.catalyst.plans.logical.BroadcastHint
    +import org.apache.spark.sql.execution.aggregate.SumOf
     import org.apache.spark.sql.types._
     import org.apache.spark.util.Utils
     
     /**
    + * Ensures that java functions signatures for methods that now return a [[TypedColumn]] still have
    + * legacy equivalents in bytecode.  This compatibility is done by forcing the compiler to generate
    + * "bridge" methods due to the use of covariant return types.
    + *
    + * {{{
    + * In LegacyFunctions:
    + * public abstract org.apache.spark.sql.Column avg(java.lang.String);
    + *
    + * In functions:
    + * public static org.apache.spark.sql.TypedColumn<java.lang.Object, java.lang.Object> avg(...);
    + * }}}
    + *
    + * This allows us to use the same functions both in typed [[Dataset]] operations and untyped
    + * [[DataFrame]] operations when the return type for a given function is statically known.
    + */
    +abstract class LegacyFunctions {
    --- End diff --
    
    @mateiz, did you see this part too?  This will let users call things like `count("*")` without having to manually say `count("*").as[Long]`, which seems pretty nice.  We can also do it without breaking binary compatibility.
    
    Questions:
     - should we do this at all (my vote is yes, at least for very common things like count)`
     - should we do it for all things that return static types
     - should we do it even for things that don't return static types.  (i.e. `avg` returns `Double` or `Decimal`.  we could default to `Double` and require the user to say `.as[BigDecimal]` if they want to maintain precision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155215081
  
    **[Test build #45425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45425/consoleFull)** for PR 9555 at commit [`0a5a161`](https://github.com/apache/spark/commit/0a5a161c063c07f500984dfb3511fe0580f91602).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44322762
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.expressions
    +
    +import org.apache.spark.sql.catalyst.encoders.{encoderFor, Encoder}
    +import org.apache.spark.sql.catalyst.expressions.aggregate.{Complete, AggregateExpression2}
    +import org.apache.spark.sql.execution.aggregate.TypedAggregateExpression
    +import org.apache.spark.sql.{Dataset, DataFrame, TypedColumn}
    +
    +/**
    + * A base class for user-defined aggregations, which can be used in [[DataFrame]] and [[Dataset]]
    + * operations to take all of the elements of a group and reduce them to a single value.
    + *
    + * For example, the following aggregator extracts an `int` from a specific class and adds them up:
    + * {{{
    + *   case class Data(i: Int)
    + *
    + *   val customSummer =  new Aggregator[Data, Int, Int] {
    + *     def prepare(d: Data) = d.i
    + *     def reduce(l: Int, r: Int) = l + r
    + *     def present(r: Int) = r
    + *   }.toColumn()
    + *
    + *   val ds: Dataset[Data]
    + *   val aggregated = ds.select(customSummer)
    + * }}}
    + *
    + * Based loosely on Aggregator from Algebird: https://github.com/twitter/algebird
    + *
    + * @tparam A The input type for the aggregation.
    + * @tparam B The type of the intermediate value of the reduction.
    + * @tparam C The type of the final result.
    + */
    +abstract class Aggregator[-A, B, C] {
    --- End diff --
    
    I like the idea of adding a `zero` and that fits better in the way we link it into the existing aggregation code.  I'll update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155238860
  
    **[Test build #45424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45424/consoleFull)** for PR 9555 at commit [`f559f5a`](https://github.com/apache/spark/commit/f559f5af7a4cc65b575f8fb71c53c79bfc0cf7b1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class MasterWebUI(`\n  * `public class JavaAFTSurvivalRegressionExample `\n  * `class TypedColumn[-T, U](expr: Expression, val encoder: Encoder[U]) extends Column(expr)`\n  * `case class TypedAggregateExpression(`\n  * `abstract class Aggregator[-A, B, C] `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155218811
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154907604
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155188209
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154909428
  
    The user-facing API looks good to me! I added some comments on the internal interfaces though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155214267
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44340946
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SumOf.scala ---
    @@ -0,0 +1,19 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    --- End diff --
    
    empty file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155239894
  
    **[Test build #45425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45425/consoleFull)** for PR 9555 at commit [`0a5a161`](https://github.com/apache/spark/commit/0a5a161c063c07f500984dfb3511fe0580f91602).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class MasterWebUI(`\n  * `class TypedColumn[-T, U](expr: Expression, val encoder: Encoder[U]) extends Column(expr)`\n  * `case class TypedAggregateExpression(`\n  * `abstract class Aggregator[-A, B, C] `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154937929
  
    **[Test build #45333 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45333/consoleFull)** for PR 9555 at commit [`9beedee`](https://github.com/apache/spark/commit/9beedee960887255a4a88e901b27ef5e7c438f83).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class TypedColumn[-T, U](expr: Expression, val encoder: Encoder[U]) extends Column(expr)`\n  * `class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable `\n  * `case class TypedAggregateExpression(`\n  * `abstract class Aggregator[-A, B, C] `\n  * `abstract class LegacyFunctions `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155214248
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-154907981
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/9555#issuecomment-155239475
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9555#discussion_r44326231
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.expressions
    +
    +import org.apache.spark.sql.catalyst.encoders.{encoderFor, Encoder}
    +import org.apache.spark.sql.catalyst.expressions.aggregate.{Complete, AggregateExpression2}
    +import org.apache.spark.sql.execution.aggregate.TypedAggregateExpression
    +import org.apache.spark.sql.{Dataset, DataFrame, TypedColumn}
    +
    +/**
    + * A base class for user-defined aggregations, which can be used in [[DataFrame]] and [[Dataset]]
    + * operations to take all of the elements of a group and reduce them to a single value.
    + *
    + * For example, the following aggregator extracts an `int` from a specific class and adds them up:
    + * {{{
    + *   case class Data(i: Int)
    + *
    + *   val customSummer =  new Aggregator[Data, Int, Int] {
    + *     def prepare(d: Data) = d.i
    + *     def reduce(l: Int, r: Int) = l + r
    + *     def present(r: Int) = r
    + *   }.toColumn()
    + *
    + *   val ds: Dataset[Data]
    + *   val aggregated = ds.select(customSummer)
    + * }}}
    + *
    + * Based loosely on Aggregator from Algebird: https://github.com/twitter/algebird
    + *
    + * @tparam A The input type for the aggregation.
    + * @tparam B The type of the intermediate value of the reduction.
    + * @tparam C The type of the final result.
    + */
    +abstract class Aggregator[-A, B, C] {
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org