You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by icexelloss <gi...@git.apache.org> on 2017/12/04 06:38:36 UTC

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

GitHub user icexelloss opened a pull request:

    https://github.com/apache/spark/pull/19872

    WIP: [SPARK-22274][PySpark] User-defined aggregation functions with pandas udf

    ## What changes were proposed in this pull request?
    
    Add support for pandas_udf in groupby().agg()
    
    ## How was this patch tested?
    
    GroupbyAggTests
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/icexelloss/spark SPARK-22274-groupby-agg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19872.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19872
    
----
commit f71575782be3f9c41184eeafa275b5ba1cb5fb83
Author: Li Jin <ic...@lis-macbook-pro.local>
Date:   2017-12-01T17:26:26Z

    Initial commit: wip

commit 2e03eec8de2ed6d38e807428c18f2500a8717b32
Author: Li Jin <ic...@lis-macbook-pro.local>
Date:   2017-12-01T22:54:02Z

    Test working. Need clean up

commit 456c4a8adf646ee46b00f8ce51d4e9e8279abc3e
Author: Li Jin <ic...@lis-macbook-pro.local>
Date:   2017-12-04T06:34:16Z

    Add tests

commit 35ff548ac942d210ccd99fb2a60b95e2d4a28e2a
Author: Li Jin <ic...@lis-macbook-pro.local>
Date:   2017-12-04T06:36:03Z

    Clean up

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r159108762
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -215,3 +228,49 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
         }
       }
     }
    +
    +
    +/**
    + * Extract all the group aggregate Pandas UDFs in logical aggregation, evaluate the UDFs first
    + * and then the expressions that depend on the result of the UDFs.
    + */
    +object ExtractGroupAggPandasUDFFromAggregate extends Rule[LogicalPlan] {
    --- End diff --
    
    I end up removing the rule. Now I put the unsafe project inside `AggregateInPandasExec` similar to `HashAggregateExec`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161658799
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The returnType should be a primitive data type, e.g, `DoubleType()`.
    +       The returned scalar can be either a python primitive type, e.g., `int` or `float`
    +       or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
    +
    +       StructType and ArrayType are currently not supported.
    +
    +       Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg`
    +
    +       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +       >>> df = spark.createDataFrame(
    +       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +       ...     ("id", "v"))
    +       >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
    +       ... def mean_udf(v):
    --- End diff --
    
    Sorry @cloud-fan, I don't understand this comment, could you elaborate? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158872825
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,29 +48,46 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF) => true
    +      case Alias(child, _) => isPandasGroupAggUdf(child)
    +      case _ => false
    +  }
    +
    +  private def hasPandasGroupAggUdf(agg: Aggregate): Boolean = {
    +    val actualAggExpr = agg.aggregateExpressions.drop(agg.groupingExpressions.length)
    +    actualAggExpr.exists(isPandasGroupAggUdf)
    +  }
    +
    +
       private def extract(agg: Aggregate): LogicalPlan = {
         val projList = new ArrayBuffer[NamedExpression]()
         val aggExpr = new ArrayBuffer[NamedExpression]()
    -    agg.aggregateExpressions.foreach { expr =>
    -      if (hasPythonUdfOverAggregate(expr, agg)) {
    -        // Python UDF can only be evaluated after aggregate
    -        val newE = expr transformDown {
    -          case e: Expression if belongAggregate(e, agg) =>
    -            val alias = e match {
    -              case a: NamedExpression => a
    -              case o => Alias(e, "agg")()
    -            }
    -            aggExpr += alias
    -            alias.toAttribute
    +
    +    if (hasPandasGroupAggUdf(agg)) {
    +      Aggregate(agg.groupingExpressions, agg.aggregateExpressions, agg.child)
    --- End diff --
    
    I am not sure. But I added copy in `ExtractGroupAggPandasUDFFromAggregate` similar to  existing rules.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162097764
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -360,9 +369,23 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
                   resultExpressions,
                   planLater(child))
               }
    -
    --- End diff --
    
    Reverted


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84415/testReport)** for PR 19872 at commit [`a1058b8`](https://github.com/apache/spark/commit/a1058b8f91bc1093ef231bf41d6553d045788abc).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157891391
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala ---
    @@ -38,3 +38,13 @@ case class FlatMapGroupsInPandas(
        */
       override val producedAttributes = AttributeSet(output)
     }
    +
    +case class AggregateInPandas(
    +    groupingAttributes: Seq[Attribute],
    +    functionExprs: Seq[Expression],
    +    output: Seq[Attribute],
    +    child: LogicalPlan
    +) extends UnaryNode {
    --- End diff --
    
    Removed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162234954
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The returnType should be a primitive data type, e.g, `DoubleType()`.
    +       The returned scalar can be either a python primitive type, e.g., `int` or `float`
    +       or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
    +
    +       StructType and ArrayType are currently not supported.
    +
    +       Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg`
    +
    +       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +       >>> df = spark.createDataFrame(
    +       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +       ...     ("id", "v"))
    +       >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
    +       ... def mean_udf(v):
    +       ...     return v.mean()
    +       >>> df.groupby("id").agg(mean_udf(df['v'])).show()  # doctest: +SKIP
    +       +---+-----------+
    +       | id|mean_udf(v)|
    +       +---+-----------+
    +       |  1|        1.5|
    +       |  2|        6.0|
    +       +---+-----------+
    +
    +       .. note:: There is no partial aggregation with group aggregate UDFs, i.e.,
    +           a full shuffle is required.
    --- End diff --
    
    one more note: we will load all the data of a group into memory, users should be aware of OOM risk if the data is skewed and there a super large group.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160778766
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -271,9 +272,14 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
           case PhysicalAggregation(
             namedGroupingExpressions, aggregateExpressions, rewrittenResultExpressions, child) =>
     
    +        require(
    +          !aggregateExpressions.exists(PythonUDF.isGroupAggPandasUDF),
    +          "Streaming aggregation doesn't support group aggregate pandas UDF"
    +        )
    --- End diff --
    
    Done


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85604 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85604/testReport)** for PR 19872 at commit [`46b111c`](https://github.com/apache/spark/commit/46b111c78105a7a0cc8fae47d064514ae496ca3b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r156031375
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4016,6 +4016,124 @@ def test_unsupported_types(self):
                 with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
                     df.groupby('id').apply(f).collect()
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +    def assertFramesEqual(self, expected, result):
    --- End diff --
    
    nit: how about making this the common method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160620400
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -511,7 +517,6 @@ def test_udf_with_order_by_and_limit(self):
             my_copy = udf(lambda x: x, IntegerType())
             df = self.spark.range(10).orderBy("id")
             res = df.select(df.id, my_copy(df.id).alias("copy")).limit(1)
    -        res.explain(True)
    --- End diff --
    
    Can we remove this?
    I guess it's only the print out of the plan, but I'd revert it just in case because it's not related to this pr.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162235851
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4273,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    --- End diff --
    
    can we manually calculate the result and check it in test?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154642230
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,135 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, SortOrder, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.StructType
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingAttributes: Seq[Attribute],
    +    func: Seq[Expression],
    +    output: Seq[Attribute],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +  private val udfs = func.map(expr => expr.asInstanceOf[PythonUDF])
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingAttributes.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingAttributes) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingAttributes.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    // val argOffsets = Array((0 until (child.output.length - groupingAttributes.length)).toArray)
    +    val schema = StructType(child.schema.drop(groupingAttributes.length))
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfs.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +          allInputs += e
    --- End diff --
    
    indentation nit


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85137/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161491996
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The returnType should be a primitive data type, e.g, `DoubleType()`.
    +       The returned scalar can be either a python primitive type, e.g., `int` or `float`
    +       or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
    +
    +       StructType and ArrayType are currently not supported.
    +
    +       Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg`
    +
    +       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +       >>> df = spark.createDataFrame(
    +       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +       ...     ("id", "v"))
    +       >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
    +       ... def mean_udf(v):
    --- End diff --
    
    shall we include grouping columns?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86189/testReport)** for PR 19872 at commit [`9824bbd`](https://github.com/apache/spark/commit/9824bbd6ca7c85cd493e5e7eef0db15bbaf1ad95).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161507315
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -288,9 +289,13 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
           case PhysicalAggregation(
             namedGroupingExpressions, aggregateExpressions, rewrittenResultExpressions, child) =>
     
    +        require(
    --- End diff --
    
    Should we throw `AnalysisException` instead?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84446/testReport)** for PR 19872 at commit [`c1dc543`](https://github.com/apache/spark/commit/c1dc543f9ff6e77b310d1a38c3f3c2c4e8eeaf63).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85446/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160616235
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -15,12 +15,30 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.python
    +package org.apache.spark.sql.catalyst.expressions
     
    -import org.apache.spark.api.python.PythonFunction
    -import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable, UserDefinedExpression}
    +import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
    +import org.apache.spark.sql.catalyst.util.toPrettySQL
     import org.apache.spark.sql.types.DataType
     
    +/**
    + * Helper functions for PythonUDF
    + */
    +object PythonUDF {
    +  def isScalarPythonUDF(e: Expression): Boolean = {
    +    e.isInstanceOf[PythonUDF] &&
    +      Set(
    --- End diff --
    
    Let's make this set `private[this] val ...` in `PythonUDF` object to avoid creating the set every time we use this method.
      


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161855872
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result5 = (df.groupby(plus_one(df.id))
    +                   .agg(plus_one(sum_udf(plus_one(df.v))))
    +                   .sort('plus_one(id)'))
    +        expected5 = (df.groupby(plus_one(df.id))
    +                     .agg(plus_one(sum(plus_one(df.v))).alias('plus_one(sum_udf(plus_one(v)))'))
    +                     .sort('plus_one(id)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +
    +    def test_multiple(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +        sum_udf = self.sum_udf
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = (df.groupBy('id')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.v),
    +                        weighted_mean_udf(df.v, df.w))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.groupBy('id')
    +                     .agg(mean(df.v).alias('mean_udf(v)'),
    +                          sum(df.v).alias('sum_udf(v)'),
    +                          mean(df.v).alias('weighted_mean_udf(v, w)'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.groupBy('id', 'v')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.id))
    +                   .sort('id', 'v')
    +                   .toPandas())
    +
    +        expected2 = (df.groupBy('id', 'v')
    +                     .agg(mean_udf(df.v).alias('mean_udf(v)'),
    +                          sum_udf(df.id).alias('sum_udf(id)'))
    +                     .sort('id', 'v')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +
    +    def test_complex_grouping(self):
    +        from pyspark.sql.functions import lit, sum
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +
    +        result1 = df.groupby(df.id + 1).agg(sum_udf(df.v))
    +        expected1 = df.groupby(df.id + 1).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result2 = df.groupby().agg(sum_udf(df.v))
    +        expected2 = df.groupby().agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v))
    +        expected3 = df.groupby(df.id, df.v % 2).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result4 = df.groupby(plus_one(df.id)).agg(sum_udf(df.v))
    +        expected4 = df.groupby(plus_one(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result5 = df.groupby(plus_two(df.id)).agg(sum_udf(df.v))
    +        expected5 = df.groupby(plus_two(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result6 = df.groupby(df.id, plus_one(df.id)).agg(sum_udf(df.v))
    +        expected6 = df.groupby(df.id, plus_one(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +        self.assertPandasEqual(expected6.toPandas(), result6.toPandas())
    +
    +    def test_complex_expression(self):
    --- End diff --
    
    I am leaning towards keeping the existing name. This doesn't just test mixed udfs, but also mixed udf and sql expressions. I added some comments in the test to be more clear. What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160779041
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -15,12 +15,30 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.python
    +package org.apache.spark.sql.catalyst.expressions
     
    -import org.apache.spark.api.python.PythonFunction
    -import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable, UserDefinedExpression}
    +import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
    +import org.apache.spark.sql.catalyst.util.toPrettySQL
     import org.apache.spark.sql.types.DataType
     
    +/**
    + * Helper functions for PythonUDF
    + */
    +object PythonUDF {
    +  def isScalarPythonUDF(e: Expression): Boolean = {
    +    e.isInstanceOf[PythonUDF] &&
    +      Set(
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    @ueshin I have addressed your existing comments. Functionality and testing wise I think this is pretty much ready. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84631/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162047382
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -333,16 +339,19 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
        */
       object Aggregation extends Strategy {
         def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    -      case PhysicalAggregation(
    -          groupingExpressions, aggregateExpressions, resultExpressions, child) =>
    +      case PhysicalAggregation(groupingExpressions, aggExpressions, resultExpressions, child)
    +        if aggExpressions.forall(expr => expr.isInstanceOf[AggregateExpression]) =>
    +
    +        val aggregateExpressions = aggExpressions.map(expr =>
    +          expr.asInstanceOf[AggregateExpression])
     
             val (functionsWithDistinct, functionsWithoutDistinct) =
               aggregateExpressions.partition(_.isDistinct)
             if (functionsWithDistinct.map(_.aggregateFunction.children).distinct.length > 1) {
               // This is a sanity check. We should not reach here when we have multiple distinct
               // column sets. Our MultipleDistinctRewriter should take care this case.
               sys.error("You hit a query analyzer bug. Please report your query to " +
    -              "Spark user mailing list.")
    +            "Spark user mailing list.")
    --- End diff --
    
    I can't believe I am nitpicking this again but let's maybe revert this change back ...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84414/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161501889
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result5 = (df.groupby(plus_one(df.id))
    +                   .agg(plus_one(sum_udf(plus_one(df.v))))
    +                   .sort('plus_one(id)'))
    +        expected5 = (df.groupby(plus_one(df.id))
    +                     .agg(plus_one(sum(plus_one(df.v))).alias('plus_one(sum_udf(plus_one(v)))'))
    +                     .sort('plus_one(id)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +
    +    def test_multiple(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +        sum_udf = self.sum_udf
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = (df.groupBy('id')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.v),
    +                        weighted_mean_udf(df.v, df.w))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.groupBy('id')
    +                     .agg(mean(df.v).alias('mean_udf(v)'),
    +                          sum(df.v).alias('sum_udf(v)'),
    +                          mean(df.v).alias('weighted_mean_udf(v, w)'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.groupBy('id', 'v')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.id))
    +                   .sort('id', 'v')
    +                   .toPandas())
    +
    +        expected2 = (df.groupBy('id', 'v')
    +                     .agg(mean_udf(df.v).alias('mean_udf(v)'),
    +                          sum_udf(df.id).alias('sum_udf(id)'))
    +                     .sort('id', 'v')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +
    +    def test_complex_grouping(self):
    +        from pyspark.sql.functions import lit, sum
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +
    +        result1 = df.groupby(df.id + 1).agg(sum_udf(df.v))
    +        expected1 = df.groupby(df.id + 1).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result2 = df.groupby().agg(sum_udf(df.v))
    +        expected2 = df.groupby().agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v))
    +        expected3 = df.groupby(df.id, df.v % 2).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result4 = df.groupby(plus_one(df.id)).agg(sum_udf(df.v))
    +        expected4 = df.groupby(plus_one(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result5 = df.groupby(plus_two(df.id)).agg(sum_udf(df.v))
    +        expected5 = df.groupby(plus_two(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result6 = df.groupby(df.id, plus_one(df.id)).agg(sum_udf(df.v))
    +        expected6 = df.groupby(df.id, plus_one(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +        self.assertPandasEqual(expected6.toPandas(), result6.toPandas())
    +
    +    def test_complex_expression(self):
    --- End diff --
    
    `test_complex_mixed_udfs` ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86487/testReport)** for PR 19872 at commit [`91885e5`](https://github.com/apache/spark/commit/91885e5dbca02daf30f4d7c8dd1560c8b1bbad47).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85531/testReport)** for PR 19872 at commit [`ca02ad4`](https://github.com/apache/spark/commit/ca02ad41f8d5f0c87a2fdd856d8f41b2d066bc64).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r159108765
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -92,8 +99,14 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
      */
     object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
     
    +  private def isPythonUDF(e: Expression): Boolean = {
    +    e.isInstanceOf[PythonUDF] &&
    +    Set(PythonEvalType.SQL_BATCHED_UDF, PythonEvalType.SQL_PANDAS_SCALAR_UDF
    --- End diff --
    
    Fixed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157895765
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,135 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, SortOrder, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.StructType
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingAttributes: Seq[Attribute],
    +    func: Seq[Expression],
    +    output: Seq[Attribute],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +  private val udfs = func.map(expr => expr.asInstanceOf[PythonUDF])
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingAttributes.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingAttributes) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingAttributes.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    // val argOffsets = Array((0 until (child.output.length - groupingAttributes.length)).toArray)
    +    val schema = StructType(child.schema.drop(groupingAttributes.length))
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfs.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +          allInputs += e
    --- End diff --
    
    Fixed. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    I don't have any concerns for merging it as is for now. Will double check tonight for sure. Was wondering if it generally looks good to you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162237228
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,155 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions._
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Physical node for aggregation with group aggregate Pandas UDF.
    + *
    + * This plan works by sending the necessary (projected) input grouped data as Arrow record batches
    + * to the python worker, the python worker invokes the UDF and sends the results to the executor,
    + * finally the executor evaluates any post-aggregation expressions and join the result with the
    + * grouped key.
    + */
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[NamedExpression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override val output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    // Filter child output attributes down to only those that are UDF inputs.
    +    // Also eliminate duplicate UDF inputs.
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +        if (allInputs.exists(_.semanticEquals(e))) {
    +          allInputs.indexWhere(_.semanticEquals(e))
    +        } else {
    +          allInputs += e
    +          dataTypes += e.dataType
    +          allInputs.length - 1
    +        }
    +      }.toArray
    +    }.toArray
    +
    +    // Schema of input rows to the python runner
    +    val aggInputSchema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
    +      StructField(s"_$i", dt)
    +    })
    +
    +    inputRDD.mapPartitionsInternal { iter =>
    +      val prunedProj = UnsafeProjection.create(allInputs, child.output)
    +
    +      val grouped = if (groupingExpressions.isEmpty) {
    +        // Use an empty unsafe row as a place holder for the grouping key
    +        Iterator((new UnsafeRow(), iter))
    +      } else {
    +        GroupedIterator(iter, groupingExpressions, child.output)
    +      }.map { case (key, rows) =>
    +        (key, rows.map(prunedProj))
    +      }
    +
    +      val context = TaskContext.get()
    +
    +      // The queue used to buffer input rows so we can drain it to
    +      // combine input with output from Python.
    +      val queue = HybridRowQueue(context.taskMemoryManager(),
    +        new File(Utils.getLocalDir(SparkEnv.get.conf)), groupingExpressions.length)
    +      context.addTaskCompletionListener { _ =>
    +        queue.close()
    +      }
    +
    +      // Add rows to queue to join later with the result.
    +      val projectedRowIter = grouped.map { case (groupingKey, rows) =>
    +        queue.add(groupingKey.asInstanceOf[UnsafeRow])
    +        rows
    +      }
    +
    +      val columnarBatchIter = new ArrowPythonRunner(
    +        pyFuncs, bufferSize, reuseWorker,
    +        PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF, argOffsets, aggInputSchema,
    +        sessionLocalTimeZone, pandasRespectSessionTimeZone)
    +        .compute(projectedRowIter, context.partitionId(), context)
    +
    +      val joinedAttributes =
    +        groupingExpressions.map(_.toAttribute) ++ udfExpressions.map(_.resultAttribute)
    +      val joined = new JoinedRow
    +      val resultProj = UnsafeProjection.create(resultExpressions, joinedAttributes)
    +
    +      columnarBatchIter.map(_.rowIterator.next()).map { aggOutputRow =>
    +        val leftRow = queue.remove()
    --- End diff --
    
    why we have to output the grouping columns? e.g. `select max(a) from t group by b`, we don't need b in the result.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85446/testReport)** for PR 19872 at commit [`66a31f9`](https://github.com/apache/spark/commit/66a31f9d50dc93e8dc5c2c843101d76951ebf2c8).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r159108782
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala ---
    @@ -171,6 +171,7 @@ trait CheckAnalysis extends PredicateHelper {
                             s"appear in the arguments of an aggregate function.")
                       }
                     }
    +              case _: PythonUDF => // OK
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by ramacode2014 <gi...@git.apache.org>.

Github user ramacode2014 commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Please unsubscribe me from this spam
    
    On Wed, Dec 20, 2017 at 10:46 AM, Takuya UESHIN <no...@github.com>
    wrote:
    
    > *@ueshin* commented on this pull request.
    > ------------------------------
    >
    > In sql/core/src/main/scala/org/apache/spark/sql/execution/
    > python/AggregateInPandasExec.scala
    > <https://github.com/apache/spark/pull/19872#discussion_r157931824>:
    >
    > > +      // Add rows to queue to join later with the result.
    > +      val projectedRowIter = grouped.map { case (groupingKey, rows) =>
    > +        queue.add(groupingKey.asInstanceOf[UnsafeRow])
    > +        rows
    > +      }
    > +
    > +      val columnarBatchIter = new ArrowPythonRunner(
    > +        pyFuncs, bufferSize, reuseWorker,
    > +        PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF, argOffsets, schema,
    > +        sessionLocalTimeZone, pandasRespectSessionTimeZone)
    > +        .compute(projectedRowIter, context.partitionId(), context)
    > +
    > +      val joined = new JoinedRow
    > +      val resultProj = UnsafeProjection.create(output, output)
    > +
    > +      columnarBatchIter.map(_.rowIterator.next()).map{ outputRow =>
    >
    > Sorry, I meant columnarBatchIter.flatMap(_.rowIterator.asScala). I'd
    > prefer this one.
    >
    > —
    > You are receiving this because you are subscribed to this thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/19872#discussion_r157931824>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/ANoGZ2Dvge10wKz7qSveY3CuiaH5nrYLks5tCIMAgaJpZM4Q0LQJ>
    > .
    >
    
    
    
    -- 
    
    
    
    
    Best Regard,
    
    
    Rama



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158951752
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,140 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions._
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[NamedExpression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override val output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +        allInputs += e
    --- End diff --
    
    Ah good point. Let me fix that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158907883
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -477,6 +502,7 @@ def test_udf_with_aggregate_function(self):
             sel = df.groupBy(my_copy(col("key")).alias("k"))\
                 .agg(sum(my_strlen(col("value"))).alias("s"))\
                 .select(my_add(col("k"), col("s")).alias("t"))
    +        self.printPlans(sel)
    --- End diff --
    
    I guess you can use `sel.explain(True)` for this purpose.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    @ueshin I pushed some more change to address your comments. There is one regression in existing test `SQLTests.test_udf_with_aggregate_function`. I will try to fix it tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157823295
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -15,10 +15,9 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.python
    +package org.apache.spark.sql.catalyst.expressions
    --- End diff --
    
    We do. This is similar to https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
    
    The reason is we need to access the class `PythonUDF` in analyzer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    cc @HyukjinKwon @holdenk @ueshin 
    
    Passing some basic tests. I will work on this more next week to clean up and add more testing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86346/testReport)** for PR 19872 at commit [`0fec5cf`](https://github.com/apache/spark/commit/0fec5cf86619f0a42647c1c53b4cb5b3d449ecd8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157944622
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,143 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, NamedExpression, PythonUDF, SortOrder, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[Expression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override def output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +
    +    allInputs.appendAll(groupingExpressions)
    --- End diff --
    
    I guess we don't need to append `groupingExpressions`. Seems like they are dropped later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158423362
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,143 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, NamedExpression, PythonUDF, SortOrder, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[Expression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override def output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +
    +    allInputs.appendAll(groupingExpressions)
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +        allInputs += e
    +        dataTypes += e.dataType
    +        allInputs.length - 1 - groupingExpressions.length
    +      }.toArray
    +    }.toArray
    +
    +    val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
    +      StructField(s"_$i", dt)
    +    })
    +
    +    inputRDD.mapPartitionsInternal { iter =>
    +      val grouped = if (groupingExpressions.isEmpty) {
    +        Iterator((null, iter))
    +      } else {
    +        val groupedIter = GroupedIterator(iter, groupingExpressions, child.output)
    +
    +        val dropGrouping =
    +          UnsafeProjection.create(allInputs.drop(groupingExpressions.length), child.output)
    +
    +        groupedIter.map {
    +          case (k, groupedRowIter) => (k, groupedRowIter.map(dropGrouping))
    +        }
    +      }
    +
    +      val context = TaskContext.get()
    +
    +      // The queue used to buffer input rows so we can drain it to
    +      // combine input with output from Python.
    +      val queue = HybridRowQueue(context.taskMemoryManager(),
    +        new File(Utils.getLocalDir(SparkEnv.get.conf)), groupingExpressions.length)
    +      context.addTaskCompletionListener { _ =>
    +        queue.close()
    +      }
    +
    +      // Add rows to queue to join later with the result.
    +      val projectedRowIter = grouped.map { case (groupingKey, rows) =>
    +        queue.add(groupingKey.asInstanceOf[UnsafeRow])
    +        rows
    +      }
    +
    +      val columnarBatchIter = new ArrowPythonRunner(
    +        pyFuncs, bufferSize, reuseWorker,
    +        PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF, argOffsets, schema,
    +        sessionLocalTimeZone, pandasRespectSessionTimeZone)
    +        .compute(projectedRowIter, context.partitionId(), context)
    +
    +      val joined = new JoinedRow
    +      val resultProj = UnsafeProjection.create(output, output)
    --- End diff --
    
    Yes, I think so about the behavior. I guess the plan could be different, though.
    We can compare the behavior with non-udf aggregation and let's follow the behavior.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160779007
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,152 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions._
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Physical node for aggregation with group aggregate Pandas UDF.
    + *
    + * This plan works by sending the necessary (projected) input grouped data as Arrow record batches
    + * to the python worker, the python worker invokes the UDF and sends the results to the executor,
    + * finally the executor evaluates any post-aggregation expressions and join the result with the
    + * grouped key.
    + */
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[NamedExpression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override val output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +        if (allInputs.exists(_.semanticEquals(e))) {
    +          allInputs.indexWhere(_.semanticEquals(e))
    +        } else {
    +          allInputs += e
    +          dataTypes += e.dataType
    +          allInputs.length - 1
    +        }
    +      }.toArray
    +    }.toArray
    +
    +    val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
    +      StructField(s"_$i", dt)
    +    })
    +
    +    val input = groupingExpressions.map(_.toAttribute) ++ udfExpressions.map(_.resultAttribute)
    --- End diff --
    
    I have changed the variable naming to be more meaningful and added some comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160708666
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -15,12 +15,30 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.python
    +package org.apache.spark.sql.catalyst.expressions
     
    -import org.apache.spark.api.python.PythonFunction
    -import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable, UserDefinedExpression}
    +import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
    +import org.apache.spark.sql.catalyst.util.toPrettySQL
     import org.apache.spark.sql.types.DataType
     
    +/**
    + * Helper functions for PythonUDF
    + */
    +object PythonUDF {
    +  def isScalarPythonUDF(e: Expression): Boolean = {
    +    e.isInstanceOf[PythonUDF] &&
    +      Set(
    --- End diff --
    
    Aha, good call.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85604 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85604/testReport)** for PR 19872 at commit [`46b111c`](https://github.com/apache/spark/commit/46b111c78105a7a0cc8fae47d064514ae496ca3b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r156037157
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,9 +48,26 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case _ @ PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF ) => true
    --- End diff --
    
    We don't need `_ @` here.
    nit: remove extra space after `SQL_PANDAS_GROUP_AGG_UDF`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162532163
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4273,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    --- End diff --
    
    oh sorry I misread the code, I thought it was testing `pandas_agg_mean_udf`, then this is totally fine, we don't need the manual test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161856927
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -334,34 +339,51 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
       object Aggregation extends Strategy {
         def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
           case PhysicalAggregation(
    -          groupingExpressions, aggregateExpressions, resultExpressions, child) =>
    -
    -        val (functionsWithDistinct, functionsWithoutDistinct) =
    -          aggregateExpressions.partition(_.isDistinct)
    -        if (functionsWithDistinct.map(_.aggregateFunction.children).distinct.length > 1) {
    -          // This is a sanity check. We should not reach here when we have multiple distinct
    -          // column sets. Our MultipleDistinctRewriter should take care this case.
    -          sys.error("You hit a query analyzer bug. Please report your query to " +
    -              "Spark user mailing list.")
    -        }
    +          groupingExpressions, aggExpressions, resultExpressions, child) =>
    +
    +        if (aggExpressions.forall(expr => expr.isInstanceOf[AggregateExpression])) {
     
    -        val aggregateOperator =
    -          if (functionsWithDistinct.isEmpty) {
    -            aggregate.AggUtils.planAggregateWithoutDistinct(
    -              groupingExpressions,
    -              aggregateExpressions,
    -              resultExpressions,
    -              planLater(child))
    -          } else {
    -            aggregate.AggUtils.planAggregateWithOneDistinct(
    -              groupingExpressions,
    -              functionsWithDistinct,
    -              functionsWithoutDistinct,
    -              resultExpressions,
    -              planLater(child))
    +          val aggregateExpressions = aggExpressions.map(expr =>
    +            expr.asInstanceOf[AggregateExpression])
    +
    +          val (functionsWithDistinct, functionsWithoutDistinct) =
    +            aggregateExpressions.partition(_.isDistinct)
    +          if (functionsWithDistinct.map(_.aggregateFunction.children).distinct.length > 1) {
    +            // This is a sanity check. We should not reach here when we have multiple distinct
    +            // column sets. Our MultipleDistinctRewriter should take care this case.
    +            sys.error("You hit a query analyzer bug. Please report your query to " +
    +              "Spark user mailing list.")
               }
     
    -        aggregateOperator
    +          val aggregateOperator =
    +            if (functionsWithDistinct.isEmpty) {
    +              aggregate.AggUtils.planAggregateWithoutDistinct(
    +                groupingExpressions,
    +                aggregateExpressions,
    +                resultExpressions,
    +                planLater(child))
    +            } else {
    +              aggregate.AggUtils.planAggregateWithOneDistinct(
    +                groupingExpressions,
    +                functionsWithDistinct,
    +                functionsWithoutDistinct,
    +                resultExpressions,
    +                planLater(child))
    +            }
    +
    +          aggregateOperator
    +        } else if (aggExpressions.forall(expr => expr.isInstanceOf[PythonUDF])) {
    +          val udfExpressions = aggExpressions.map(expr => expr.asInstanceOf[PythonUDF])
    +
    +          Seq(execution.python.AggregateInPandasExec(
    +            groupingExpressions,
    +            udfExpressions,
    +            resultExpressions,
    +            planLater(child)))
    +        } else {
    +          throw new IllegalArgumentException(
    --- End diff --
    
    +1. Let's double check in https://github.com/apache/spark/pull/19872#discussion_r161507315


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85613 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85613/testReport)** for PR 19872 at commit [`8c39469`](https://github.com/apache/spark/commit/8c39469db55f8933212dcdbf1ca2564bbf0f5c7d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157896062
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -89,8 +89,15 @@ def agg(self, *exprs):
             else:
                 # Columns
                 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
    -            jdf = self._jgd.agg(exprs[0]._jc,
    -                                _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
    +            if isinstance(exprs[0], UDFColumn):
    +                assert all(isinstance(c, UDFColumn) for c in exprs)
    --- End diff --
    
    Answered in https://github.com/apache/spark/pull/19872#issuecomment-350139367


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85138/testReport)** for PR 19872 at commit [`62c8f00`](https://github.com/apache/spark/commit/62c8f00b84ca600ea47ebb6db99dc86890099e4b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157891477
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
               df.logicalPlan))
       }
     
    +
    +  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
    +    val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
    +
    +    val groupingNamedExpressions = groupingExprs.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    +    }
    +
    +    val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
    +
    +    val child = df.logicalPlan
    +
    +    val childrenExpressions = exprs.flatMap(expr =>
    +      expr.children.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    +    })
    +
    +    val project = Project(groupingNamedExpressions ++ childrenExpressions, child)
    +
    +    val udfOutputs = exprs.flatMap(expr =>
    +      Seq(AttributeReference(expr.name, expr.dataType)())
    +    )
    --- End diff --
    
    Removed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161689381
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The returnType should be a primitive data type, e.g, `DoubleType()`.
    +       The returned scalar can be either a python primitive type, e.g., `int` or `float`
    +       or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
    +
    +       StructType and ArrayType are currently not supported.
    +
    +       Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg`
    +
    +       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +       >>> df = spark.createDataFrame(
    +       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +       ...     ("id", "v"))
    +       >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
    +       ... def mean_udf(v):
    --- End diff --
    
    similar to the `GROUP_MAP` UDF, shall we also consider putting the grouping columns in the parameters of `GROUP_AGG` UDF?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84628/testReport)** for PR 19872 at commit [`3352050`](https://github.com/apache/spark/commit/335205037470228fa615def5d1246231b546c467).
     * This patch **fails Python style tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154569177
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -113,6 +113,7 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
       def apply(plan: SparkPlan): SparkPlan = plan transformUp {
         // FlatMapGroupsInPandas can be evaluated directly in python worker
         // Therefore we don't need to extract the UDFs
    --- End diff --
    
    `FlatMapGroupsInPandas` and `AggregateInPandasExec` can be...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162235824
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4273,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    --- End diff --
    
    `weighted_mean_udf` and `mean` are both to be tested, we can't compare the result of them...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158911077
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -273,7 +274,7 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
     
             aggregate.AggUtils.planStreamingAggregation(
               namedGroupingExpressions,
    -          aggregateExpressions,
    +          aggregateExpressions.map(expr => expr.asInstanceOf[AggregateExpression]),
    --- End diff --
    
    Does it means that pandas UDAF doesn't support streaming aggregation?
    Maybe we should check that streaming aggregation doesn't contain pandas UDAF and throw exception with reasonable message if it contains.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161855060
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    --- End diff --
    
    Done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161855220
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result5 = (df.groupby(plus_one(df.id))
    +                   .agg(plus_one(sum_udf(plus_one(df.v))))
    +                   .sort('plus_one(id)'))
    +        expected5 = (df.groupby(plus_one(df.id))
    +                     .agg(plus_one(sum(plus_one(df.v))).alias('plus_one(sum_udf(plus_one(v)))'))
    +                     .sort('plus_one(id)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +
    +    def test_multiple(self):
    --- End diff --
    
    Done. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Hi all. I have addressed all comments. Let me know if you have any more comments. Thank you!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    +1 for master-only. We can cherry-pick and backport if we should even after this gets merged anyway. For a reminder, we should complete the doc https://github.com/apache/spark/pull/19575 too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158952743
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -477,6 +502,7 @@ def test_udf_with_aggregate_function(self):
             sel = df.groupBy(my_copy(col("key")).alias("k"))\
                 .agg(sum(my_strlen(col("value"))).alias("s"))\
                 .select(my_add(col("k"), col("s")).alias("t"))
    +        self.printPlans(sel)
    --- End diff --
    
    Oh that's great. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86492/testReport)** for PR 19872 at commit [`cc659bc`](https://github.com/apache/spark/commit/cc659bc2487d81a9497bd032049c2c4272660716).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85442/testReport)** for PR 19872 at commit [`99367a6`](https://github.com/apache/spark/commit/99367a6e0226a2e2dbd699b897c39d9ccc43e04b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85947/testReport)** for PR 19872 at commit [`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157891261
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4016,6 +4016,89 @@ def test_unsupported_types(self):
                 with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
                     df.groupby('id').apply(f).collect()
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +    def assertFramesEqual(self, expected, result):
    +        msg = ("DataFrames are not equal: " +
    +               ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162097637
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -333,16 +339,19 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
        */
       object Aggregation extends Strategy {
         def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
    -      case PhysicalAggregation(
    -          groupingExpressions, aggregateExpressions, resultExpressions, child) =>
    +      case PhysicalAggregation(groupingExpressions, aggExpressions, resultExpressions, child)
    +        if aggExpressions.forall(expr => expr.isInstanceOf[AggregateExpression]) =>
    +
    +        val aggregateExpressions = aggExpressions.map(expr =>
    +          expr.asInstanceOf[AggregateExpression])
     
             val (functionsWithDistinct, functionsWithoutDistinct) =
               aggregateExpressions.partition(_.isDistinct)
             if (functionsWithDistinct.map(_.aggregateFunction.children).distinct.length > 1) {
               // This is a sanity check. We should not reach here when we have multiple distinct
               // column sets. Our MultipleDistinctRewriter should take care this case.
               sys.error("You hit a query analyzer bug. Please report your query to " +
    -              "Spark user mailing list.")
    +            "Spark user mailing list.")
    --- End diff --
    
    No you are right I shouldn't have changed style. Reverted :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r159108771
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,140 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions._
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[NamedExpression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override val output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +        allInputs += e
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165387302
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper {
     object PhysicalAggregation {
       // groupingExpressions, aggregateExpressions, resultExpressions, child
       type ReturnType =
    -    (Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan)
    +    (Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan)
    --- End diff --
    
    @yhuai Yeah I can certainly try it out. Created https://issues.apache.org/jira/browse/SPARK-23302 to track.
    
    I assume this is not urgent?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162402605
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -65,7 +65,16 @@ def __init__(self, jgd, df):
         def agg(self, *exprs):
             """Compute aggregates and returns the result as a :class:`DataFrame`.
     
    -        The available aggregate functions are `avg`, `max`, `min`, `sum`, `count`.
    +        The available aggregate functions can be:
    +
    +        1. built-in aggregation functions, such as `avg`, `max`, `min`, `sum`, `count`
    +
    +        2. group aggregate pandas UDFs
    +
    +           .. note:: There is no partial aggregation with group aggregate UDFs, i.e.,
    +               a full shuffle is required.
    +
    +           .. seealso:: :meth:`pyspark.sql.functions.pandas_udf`
    --- End diff --
    
    Added


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84632/testReport)** for PR 19872 at commit [`37eff29`](https://github.com/apache/spark/commit/37eff294bc3825763fc438bfc4c291cbacfb0a0f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161500723
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    --- End diff --
    
    Hm, what this test targets?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    And to @holdenk 's question. Pandas group_agg udf fundamentally uses different physical plan than the existing java/scala udf and therefore it's hard to combine them together. I don't know a good way to do this, the closest is maybe to compute java/scala and python aggregation separately and join them together.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162047047
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    --- End diff --
    
    Shall we add a prefix here too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161496153
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    --- End diff --
    
    Let's add each comment for each test here. Seems hard to read.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    @ueshin and @HyukjinKwon Thanks much for the review. I have addressed the latest comment.
    
    @ueshin I think UDAF that supports partial aggregation can be build on top of this. The questions you asked about UDAF interfaces are very good. I haven't thought them through. I think it will probably end up to be another API for define pandas UDF (has `update` `merge` `finalize` methods, for instance) and another physical plan. I think we can leave that for the future. Is there anything specific issue with regard to UDAF that supports partial aggregation that you want to address here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165385572
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4353,6 +4347,446 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_manual(self):
    +        df = self.data
    +        sum_udf = self.pandas_agg_sum_udf
    +        mean_udf = self.pandas_agg_mean_udf
    +
    +        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        expected1 = self.spark.createDataFrame(
    +            [[0, 245.0, 24.5],
    +             [1, 255.0, 25.5],
    +             [2, 265.0, 26.5],
    +             [3, 275.0, 27.5],
    +             [4, 285.0, 28.5],
    +             [5, 295.0, 29.5],
    +             [6, 305.0, 30.5],
    +             [7, 315.0, 31.5],
    +             [8, 325.0, 32.5],
    +             [9, 335.0, 33.5]],
    +            ['id', 'sum(v)', 'avg(v)'])
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF with literal
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        # Groupby one column and aggregate one UDF without literal
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF without literal
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_unsupported_types(self):
    +        from pyspark.sql.types import ArrayType, DoubleType, MapType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegex(NotImplementedError, 'not supported'):
    --- End diff --
    
    @ueshin Thanks for fixing this. (I am late to the party)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160708246
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -511,7 +517,6 @@ def test_udf_with_order_by_and_limit(self):
             my_copy = udf(lambda x: x, IntegerType())
             df = self.spark.range(10).orderBy("id")
             res = df.select(df.id, my_copy(df.id).alias("copy")).limit(1)
    -        res.explain(True)
    --- End diff --
    
    Yes, I think this is removed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157823488
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -32,7 +31,5 @@ case class PythonUDF(
         evalType: Int)
       extends Expression with Unevaluable with NonSQLExpression with UserDefinedExpression {
     
    -  override def toString: String = s"$name(${children.mkString(", ")})"
    --- End diff --
    
    Whoops, my bad, adding back


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85152/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158301996
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,29 +48,46 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF) => true
    +      case Alias(child, _) => isPandasGroupAggUdf(child)
    +      case _ => false
    +  }
    +
    +  private def hasPandasGroupAggUdf(agg: Aggregate): Boolean = {
    +    val actualAggExpr = agg.aggregateExpressions.drop(agg.groupingExpressions.length)
    +    actualAggExpr.exists(isPandasGroupAggUdf)
    +  }
    +
    +
    --- End diff --
    
    Removed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162235628
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -363,6 +371,21 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
     
             aggregateOperator
     
    +      case PhysicalAggregation(groupingExpressions, aggExpressions, resultExpressions, child)
    +        if aggExpressions.forall(expr => expr.isInstanceOf[PythonUDF]) =>
    +        val udfExpressions = aggExpressions.map(expr => expr.asInstanceOf[PythonUDF])
    +
    +        Seq(execution.python.AggregateInPandasExec(
    +          groupingExpressions,
    +          udfExpressions,
    +          resultExpressions,
    +          planLater(child)))
    +
    +      case PhysicalAggregation(groupingExpressions, aggExpressions, resultExpressions, child) =>
    --- End diff --
    
    How about `case PhysicalAggregation(_) =>`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161854767
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    --- End diff --
    
    Changed to `GroupbyAggPandasUDFTests`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154644084
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
         return udf_obj._wrapped()
     
     
    +class UDFColumn(Column):
    --- End diff --
    
    BTW, what do you think about adding an attribute instead?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161701816
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The returnType should be a primitive data type, e.g, `DoubleType()`.
    +       The returned scalar can be either a python primitive type, e.g., `int` or `float`
    +       or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
    +
    +       StructType and ArrayType are currently not supported.
    +
    +       Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg`
    +
    +       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +       >>> df = spark.createDataFrame(
    +       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +       ...     ("id", "v"))
    +       >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
    +       ... def mean_udf(v):
    --- End diff --
    
    IIUC, this is similar to SQL's aggregation, and this aggregation UDF should only take the column for aggregation without grouping columns.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19872


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85138/testReport)** for PR 19872 at commit [`62c8f00`](https://github.com/apache/spark/commit/62c8f00b84ca600ea47ebb6db99dc86890099e4b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85932/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162235197
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -82,6 +91,13 @@ def agg(self, *exprs):
             >>> from pyspark.sql import functions as F
             >>> sorted(gdf.agg(F.min(df.age)).collect())
             [Row(name=u'Alice', min(age)=2), Row(name=u'Bob', min(age)=5)]
    +
    +        >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +        >>> @pandas_udf('int', PandasUDFType.GROUP_AGG)
    +        ... def min_udf(v):
    +        ...     return v.min()
    +        >>> sorted(gdf.agg(min_udf(df.age)).collect())  # doctest: +SKIP
    --- End diff --
    
    I think in the future we should make pandas/arrow a requirement of pyspark, so that we can always assume the pandas/arrow is installed when run doc test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85531/testReport)** for PR 19872 at commit [`ca02ad4`](https://github.com/apache/spark/commit/ca02ad41f8d5f0c87a2fdd856d8f41b2d066bc64).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85476/testReport)** for PR 19872 at commit [`2800344`](https://github.com/apache/spark/commit/28003442b6c7605363fef56ae40c294dd680d15f).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154782452
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -89,8 +89,15 @@ def agg(self, *exprs):
             else:
                 # Columns
                 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
    -            jdf = self._jgd.agg(exprs[0]._jc,
    -                                _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
    +            if isinstance(exprs[0], UDFColumn):
    +                assert all(isinstance(c, UDFColumn) for c in exprs)
    --- End diff --
    
    So I'm a little worried about this change, if other folks have wrapped Java UDAFs (which is reasonable since there aren't other ways to make UDAFs in PySpark before this), this seems like they won't be able to mix them. I'd suggest maybe doing what @viirya suggested bellow but instead of a failure just a warning until Spark 3.
    
    What do y'all think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85932 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85932/testReport)** for PR 19872 at commit [`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86189/testReport)** for PR 19872 at commit [`9824bbd`](https://github.com/apache/spark/commit/9824bbd6ca7c85cd493e5e7eef0db15bbaf1ad95).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162097452
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    --- End diff --
    
    Added


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84632/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84630/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Hey @ueshin, Will check and help double check anyway but is it closer to what you think BTW? I saw you made many prototypes and wonder if this one sounds generally okay to you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86492 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86492/testReport)** for PR 19872 at commit [`cc659bc`](https://github.com/apache/spark/commit/cc659bc2487d81a9497bd032049c2c4272660716).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165253514
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper {
     object PhysicalAggregation {
       // groupingExpressions, aggregateExpressions, resultExpressions, child
       type ReturnType =
    -    (Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan)
    +    (Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan)
    --- End diff --
    
    I prefer that we try out using a new rule. We can create utility function to reuse code. Will you have a chance to try it out?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86345/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161708646
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The returnType should be a primitive data type, e.g, `DoubleType()`.
    +       The returned scalar can be either a python primitive type, e.g., `int` or `float`
    +       or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
    +
    +       StructType and ArrayType are currently not supported.
    +
    +       Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg`
    +
    +       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +       >>> df = spark.createDataFrame(
    +       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +       ...     ("id", "v"))
    +       >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
    +       ... def mean_udf(v):
    --- End diff --
    
    ah makes sense


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157891697
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,9 +48,26 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case _ @ PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF ) => true
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157896109
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -113,6 +113,7 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
       def apply(plan: SparkPlan): SparkPlan = plan transformUp {
         // FlatMapGroupsInPandas can be evaluated directly in python worker
         // Therefore we don't need to extract the UDFs
    --- End diff --
    
    Added


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Addressed latest comments. Yeah I think master only is fine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/109/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86279/testReport)** for PR 19872 at commit [`8d2ad6a`](https://github.com/apache/spark/commit/8d2ad6ab8668626627b68d059f832a7974e1c971).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Ping @ueshin, do you think of the current state of the PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154569953
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -89,8 +89,15 @@ def agg(self, *exprs):
             else:
                 # Columns
                 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
    -            jdf = self._jgd.agg(exprs[0]._jc,
    -                                _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
    +            if isinstance(exprs[0], UDFColumn):
    +                assert all(isinstance(c, UDFColumn) for c in exprs)
    --- End diff --
    
    Like `all exprs should be UDFColumn"`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154615728
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
         return udf_obj._wrapped()
     
     
    +class UDFColumn(Column):
    --- End diff --
    
    Why did we add this new sub-class?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161499430
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result5 = (df.groupby(plus_one(df.id))
    +                   .agg(plus_one(sum_udf(plus_one(df.v))))
    +                   .sort('plus_one(id)'))
    +        expected5 = (df.groupby(plus_one(df.id))
    +                     .agg(plus_one(sum(plus_one(df.v))).alias('plus_one(sum_udf(plus_one(v)))'))
    +                     .sort('plus_one(id)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +
    +    def test_multiple(self):
    --- End diff --
    
    `test_multiple_udfs` .. ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84414 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84414/testReport)** for PR 19872 at commit [`4cfaf0e`](https://github.com/apache/spark/commit/4cfaf0e9723bcfbb74dfd1b9d1f5e30682bf072f).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161856598
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -288,9 +289,13 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
           case PhysicalAggregation(
             namedGroupingExpressions, aggregateExpressions, rewrittenResultExpressions, child) =>
     
    +        require(
    --- End diff --
    
    Yeah I was thinking about this one too. I prefer `AnalysisException` too but want to make sure. @cloud-fan what do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86279/testReport)** for PR 19872 at commit [`8d2ad6a`](https://github.com/apache/spark/commit/8d2ad6ab8668626627b68d059f832a7974e1c971).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86280/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161855488
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result5 = (df.groupby(plus_one(df.id))
    +                   .agg(plus_one(sum_udf(plus_one(df.v))))
    +                   .sort('plus_one(id)'))
    +        expected5 = (df.groupby(plus_one(df.id))
    +                     .agg(plus_one(sum(plus_one(df.v))).alias('plus_one(sum_udf(plus_one(v)))'))
    +                     .sort('plus_one(id)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +
    +    def test_multiple(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +        sum_udf = self.sum_udf
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = (df.groupBy('id')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.v),
    +                        weighted_mean_udf(df.v, df.w))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.groupBy('id')
    +                     .agg(mean(df.v).alias('mean_udf(v)'),
    +                          sum(df.v).alias('sum_udf(v)'),
    +                          mean(df.v).alias('weighted_mean_udf(v, w)'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.groupBy('id', 'v')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.id))
    +                   .sort('id', 'v')
    +                   .toPandas())
    +
    +        expected2 = (df.groupBy('id', 'v')
    +                     .agg(mean_udf(df.v).alias('mean_udf(v)'),
    +                          sum_udf(df.id).alias('sum_udf(id)'))
    +                     .sort('id', 'v')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +
    +    def test_complex_grouping(self):
    --- End diff --
    
    Changed to `test_complex_groupby`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/106/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    > I thought @ueshin is working on this BTW.
    
    Oh, I certainly don't want to duplicate @ueshin 's work. I am under the impression that @ueshin is working on two-stage PySpark UDAF with pandas_udf, but I cannot really find the Jira for it...
    
    @ueshin can you point me to what you are working on so I don't overstep?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86279/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157939292
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,29 +48,46 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF) => true
    +      case Alias(child, _) => isPandasGroupAggUdf(child)
    +      case _ => false
    +  }
    +
    +  private def hasPandasGroupAggUdf(agg: Aggregate): Boolean = {
    +    val actualAggExpr = agg.aggregateExpressions.drop(agg.groupingExpressions.length)
    --- End diff --
    
    Do we need to drop the grouping expressions?
    If we need, we can drop them only if `conf.dataFrameRetainGroupColumns == true`, otherwise `aggregateExpressions` doesn't contain `groupingExpressions`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85604/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161854698
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    --- End diff --
    
    This is to test mixing group aggregate pandas UDF with sql expressions. I also added some comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85137/testReport)** for PR 19872 at commit [`1a197b7`](https://github.com/apache/spark/commit/1a197b760beef191615020cfec6fdaaa5c465fdb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84632/testReport)** for PR 19872 at commit [`37eff29`](https://github.com/apache/spark/commit/37eff294bc3825763fc438bfc4c291cbacfb0a0f).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86187/testReport)** for PR 19872 at commit [`9fbf012`](https://github.com/apache/spark/commit/9fbf01275159fb7b16cf11687510746d174a7e1f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84628/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85476/testReport)** for PR 19872 at commit [`2800344`](https://github.com/apache/spark/commit/28003442b6c7605363fef56ae40c294dd680d15f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154644235
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
               df.logicalPlan))
       }
     
    +
    +  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
    +    val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
    +
    +    val groupingNamedExpressions = groupingExprs.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    +    }
    +
    +    val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
    +
    +    val child = df.logicalPlan
    +
    +    val childrenExpressions = exprs.flatMap(expr =>
    +      expr.children.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    --- End diff --
    
    indentation nit


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85136 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85136/testReport)** for PR 19872 at commit [`ab91314`](https://github.com/apache/spark/commit/ab91314e8f89162f75493802a7f1fbd1e319d8ec).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158872704
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,29 +48,46 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF) => true
    +      case Alias(child, _) => isPandasGroupAggUdf(child)
    +      case _ => false
    +  }
    +
    +  private def hasPandasGroupAggUdf(agg: Aggregate): Boolean = {
    +    val actualAggExpr = agg.aggregateExpressions.drop(agg.groupingExpressions.length)
    --- End diff --
    
    This is fixed. Added `test_retain_grouping_columns` test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162047640
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -39,18 +38,20 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
        */
       private def belongAggregate(e: Expression, agg: Aggregate): Boolean = {
         e.isInstanceOf[AggregateExpression] ||
    +      PythonUDF.isGroupAggPandasUDF(e) ||
           agg.groupingExpressions.exists(_.semanticEquals(e))
       }
     
       private def hasPythonUdfOverAggregate(expr: Expression, agg: Aggregate): Boolean = {
         expr.find {
    -      e => e.isInstanceOf[PythonUDF] && e.find(belongAggregate(_, agg)).isDefined
    +      e => PythonUDF.isScalarPythonUDF(e) && e.find(belongAggregate(_, agg)).isDefined
         }.isDefined
       }
     
       private def extract(agg: Aggregate): LogicalPlan = {
         val projList = new ArrayBuffer[NamedExpression]()
         val aggExpr = new ArrayBuffer[NamedExpression]()
    +
    --- End diff --
    
    ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86487/testReport)** for PR 19872 at commit [`91885e5`](https://github.com/apache/spark/commit/91885e5dbca02daf30f4d7c8dd1560c8b1bbad47).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162367851
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4273,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    --- End diff --
    
    Here `mean` is the spark built-in aggregation function, so it's not a test target here. This is a behavior test that test "aggregate UDF should be the same as equivalent built-in function".
    
    I generally prefer behavior test over manual test because they are more robust and we can write a lot of them. Maybe we can add one manual test result to be safe, what do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85152/testReport)** for PR 19872 at commit [`ea5d6f3`](https://github.com/apache/spark/commit/ea5d6f319aa3b1bba20ad86a51e6efb65658e3d2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r159108773
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -273,7 +274,7 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
     
             aggregate.AggUtils.planStreamingAggregation(
               namedGroupingExpressions,
    -          aggregateExpressions,
    +          aggregateExpressions.map(expr => expr.asInstanceOf[AggregateExpression]),
    --- End diff --
    
    Added check.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157931824
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,143 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, NamedExpression, PythonUDF, SortOrder, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[Expression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override def output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +
    +    allInputs.appendAll(groupingExpressions)
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +          allInputs += e
    +          dataTypes += e.dataType
    +          allInputs.length - 1 - groupingExpressions.length
    +      }.toArray
    +    }.toArray
    +
    +    val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
    +      StructField(s"_$i", dt)
    +    })
    +
    +    inputRDD.mapPartitionsInternal { iter =>
    +      val grouped = if (groupingExpressions.isEmpty) {
    +        Iterator((null, iter))
    +      } else {
    +        val groupedIter = GroupedIterator(iter, groupingExpressions, child.output)
    +
    +        val dropGrouping =
    +          UnsafeProjection.create(allInputs.drop(groupingExpressions.length), child.output)
    +
    +        groupedIter.map {
    +          case (k, groupedRowIter) => (k, groupedRowIter.map(dropGrouping))
    +        }
    +      }
    +
    +      val context = TaskContext.get()
    +
    +      // The queue used to buffer input rows so we can drain it to
    +      // combine input with output from Python.
    +      val queue = HybridRowQueue(context.taskMemoryManager(),
    +        new File(Utils.getLocalDir(SparkEnv.get.conf)), groupingExpressions.length)
    +      context.addTaskCompletionListener { _ =>
    +        queue.close()
    +      }
    +
    +      // Add rows to queue to join later with the result.
    +      val projectedRowIter = grouped.map { case (groupingKey, rows) =>
    +        queue.add(groupingKey.asInstanceOf[UnsafeRow])
    +        rows
    +      }
    +
    +      val columnarBatchIter = new ArrowPythonRunner(
    +        pyFuncs, bufferSize, reuseWorker,
    +        PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF, argOffsets, schema,
    +        sessionLocalTimeZone, pandasRespectSessionTimeZone)
    +        .compute(projectedRowIter, context.partitionId(), context)
    +
    +      val joined = new JoinedRow
    +      val resultProj = UnsafeProjection.create(output, output)
    +
    +      columnarBatchIter.map(_.rowIterator.next()).map{ outputRow =>
    --- End diff --
    
    Sorry, I meant `columnarBatchIter.flatMap(_.rowIterator.asScala)`. I'd prefer this one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85605 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85605/testReport)** for PR 19872 at commit [`5a4fc58`](https://github.com/apache/spark/commit/5a4fc58c579ed432b00a80eed69abec8ea9e8b21).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158952133
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -273,7 +274,7 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
     
             aggregate.AggUtils.planStreamingAggregation(
               namedGroupingExpressions,
    -          aggregateExpressions,
    +          aggregateExpressions.map(expr => expr.asInstanceOf[AggregateExpression]),
    --- End diff --
    
    Yeah streaming aggregation won't work. Good point. Let me add the check.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85928 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85928/testReport)** for PR 19872 at commit [`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86189/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161856742
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -334,34 +339,51 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
       object Aggregation extends Strategy {
         def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
           case PhysicalAggregation(
    --- End diff --
    
    Done. I changed to case matching.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86196 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86196/testReport)** for PR 19872 at commit [`46db380`](https://github.com/apache/spark/commit/46db3802561088d1683681ea24f457c39226bdc5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160617597
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,152 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions._
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Physical node for aggregation with group aggregate Pandas UDF.
    + *
    + * This plan works by sending the necessary (projected) input grouped data as Arrow record batches
    + * to the python worker, the python worker invokes the UDF and sends the results to the executor,
    + * finally the executor evaluates any post-aggregation expressions and join the result with the
    + * grouped key.
    + */
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[NamedExpression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override val output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +        if (allInputs.exists(_.semanticEquals(e))) {
    +          allInputs.indexWhere(_.semanticEquals(e))
    +        } else {
    +          allInputs += e
    +          dataTypes += e.dataType
    +          allInputs.length - 1
    +        }
    +      }.toArray
    +    }.toArray
    +
    +    val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
    +      StructField(s"_$i", dt)
    +    })
    +
    +    val input = groupingExpressions.map(_.toAttribute) ++ udfExpressions.map(_.resultAttribute)
    --- End diff --
    
    nit: maybe this name `input` is confusing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PySpark] User-defined aggregation f...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r159248298
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -82,6 +91,13 @@ def agg(self, *exprs):
             >>> from pyspark.sql import functions as F
             >>> sorted(gdf.agg(F.min(df.age)).collect())
             [Row(name=u'Alice', min(age)=2), Row(name=u'Bob', min(age)=5)]
    +
    +        >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +        >>> @pandas_udf('int', PandasUDFType.GROUP_AGG)
    +        ... def min_udf(v):
    +        ...     return v.min()
    +        >>> sorted(gdf.agg(min_udf(df.age)).collect())  # doctest: +SKIP
    --- End diff --
    
    I don't know a good way of skipping doctest when pyarrow is not available... If others have some ideas, please let me know


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162373838
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -363,6 +371,21 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
     
             aggregateOperator
     
    +      case PhysicalAggregation(groupingExpressions, aggExpressions, resultExpressions, child)
    +        if aggExpressions.forall(expr => expr.isInstanceOf[PythonUDF]) =>
    +        val udfExpressions = aggExpressions.map(expr => expr.asInstanceOf[PythonUDF])
    +
    +        Seq(execution.python.AggregateInPandasExec(
    +          groupingExpressions,
    +          udfExpressions,
    +          resultExpressions,
    +          planLater(child)))
    +
    +      case PhysicalAggregation(groupingExpressions, aggExpressions, resultExpressions, child) =>
    --- End diff --
    
    Aha good point.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86345 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86345/testReport)** for PR 19872 at commit [`17fad5c`](https://github.com/apache/spark/commit/17fad5c0f83edb142471e2c4a1ffad08d7a29c5d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161505033
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -15,12 +15,31 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.python
    +package org.apache.spark.sql.catalyst.expressions
     
    -import org.apache.spark.api.python.PythonFunction
    -import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable, UserDefinedExpression}
    +import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
    +import org.apache.spark.sql.catalyst.util.toPrettySQL
     import org.apache.spark.sql.types.DataType
     
    +/**
    + * Helper functions for PythonUDF
    --- End diff --
    
    I can't believe I am nitpicking this: `PythonUDF` -> `` `PythonUDF` ``.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/5/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86187/testReport)** for PR 19872 at commit [`9fbf012`](https://github.com/apache/spark/commit/9fbf01275159fb7b16cf11687510746d174a7e1f).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161855128
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    --- End diff --
    
    Yes sorry! I added comments for each test case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162886239
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2221,6 +2223,35 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The `returnType` should be a primitive data type, e.g, :class:`DoubleType`.
    --- End diff --
    
    very small nit: `e.g.` instead of `e.g`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165449847
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper {
     object PhysicalAggregation {
       // groupingExpressions, aggregateExpressions, resultExpressions, child
       type ReturnType =
    -    (Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan)
    +    (Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan)
    --- End diff --
    
    It will be good to try it out soon. But it is not urgent.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162047620
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -27,7 +27,6 @@ import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan, Proj
     import org.apache.spark.sql.catalyst.rules.Rule
     import org.apache.spark.sql.execution.{FilterExec, ProjectExec, SparkPlan}
     
    -
    --- End diff --
    
    Here too, let's keep the changes related with what it proposes. This PR is already quite big.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85605 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85605/testReport)** for PR 19872 at commit [`5a4fc58`](https://github.com/apache/spark/commit/5a4fc58c579ed432b00a80eed69abec8ea9e8b21).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Only few nits. LGTM but let me leave it to @ueshin.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165224852
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper {
     object PhysicalAggregation {
       // groupingExpressions, aggregateExpressions, resultExpressions, child
       type ReturnType =
    -    (Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan)
    +    (Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan)
    --- End diff --
    
    Hi @yhuai,
    
    You bring up a good point. I agree with you ideally we should avoid doing. When I was making the change, I found the solution implemented results in least amount of duplicate code, because a lot of logic is shared between AggregateExpression and Python UDF, but the downside is exactly what you mentioned. 
    
    One alternative is to create new rules for Python UDAF, my concern is that could result in quite a bit of code duplication. Maybe there is a way to avoid code duplication and keep the type safety, I am happy to explore the option. (Maybe create a parent class for AggregateExpression and Python UDAF)?
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Thanks all for review!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86280/testReport)** for PR 19872 at commit [`6fa2a8c`](https://github.com/apache/spark/commit/6fa2a8c24d777ebb1fb484e0a827eb0a453b23d6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84630 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84630/testReport)** for PR 19872 at commit [`184b37f`](https://github.com/apache/spark/commit/184b37f49817488f8cc60f2c392c5ad746d23927).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157944969
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,143 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, NamedExpression, PythonUDF, SortOrder, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[Expression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override def output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +
    +    allInputs.appendAll(groupingExpressions)
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +        allInputs += e
    +        dataTypes += e.dataType
    +        allInputs.length - 1 - groupingExpressions.length
    +      }.toArray
    +    }.toArray
    +
    +    val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
    +      StructField(s"_$i", dt)
    +    })
    +
    +    inputRDD.mapPartitionsInternal { iter =>
    +      val grouped = if (groupingExpressions.isEmpty) {
    +        Iterator((null, iter))
    +      } else {
    +        val groupedIter = GroupedIterator(iter, groupingExpressions, child.output)
    +
    +        val dropGrouping =
    +          UnsafeProjection.create(allInputs.drop(groupingExpressions.length), child.output)
    +
    +        groupedIter.map {
    +          case (k, groupedRowIter) => (k, groupedRowIter.map(dropGrouping))
    +        }
    +      }
    +
    +      val context = TaskContext.get()
    +
    +      // The queue used to buffer input rows so we can drain it to
    +      // combine input with output from Python.
    +      val queue = HybridRowQueue(context.taskMemoryManager(),
    +        new File(Utils.getLocalDir(SparkEnv.get.conf)), groupingExpressions.length)
    +      context.addTaskCompletionListener { _ =>
    +        queue.close()
    +      }
    +
    +      // Add rows to queue to join later with the result.
    +      val projectedRowIter = grouped.map { case (groupingKey, rows) =>
    +        queue.add(groupingKey.asInstanceOf[UnsafeRow])
    +        rows
    +      }
    +
    +      val columnarBatchIter = new ArrowPythonRunner(
    +        pyFuncs, bufferSize, reuseWorker,
    +        PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF, argOffsets, schema,
    +        sessionLocalTimeZone, pandasRespectSessionTimeZone)
    +        .compute(projectedRowIter, context.partitionId(), context)
    +
    +      val joined = new JoinedRow
    +      val resultProj = UnsafeProjection.create(output, output)
    --- End diff --
    
    We need to handle `resultExpressions` for the following cases:
    
    ```python
        def test_result_expressions(self):
            import numpy as np
            from pyspark.sql.functions import mean, pandas_udf, PandasUDFType
    
            df = self.data
    
            @pandas_udf('double', PandasUDFType.GROUP_AGG)
            def mean_udf(v, w):
                return np.average(v, weights=w)
    
            result1 = (df.groupby('id')
                       .agg(mean_udf(df.v, lit(1.0)) + 1)
                       .sort('id')
                       .toPandas())
    
            expected1 = (df.groupby('id')
                         .agg(mean(df.v) + 1)
                         .sort('id')
                         .toPandas())
    
            self.assertPandasEqual(expected1, result1)
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r159108798
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4052,6 +4066,323 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, float)
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        self.spark.conf.set("spark.sql.codegen.wholeStage", False)
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_multiple(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +        sum_udf = self.sum_udf
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = (df.groupBy('id')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.v),
    +                        weighted_mean_udf(df.v, df.w))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.groupBy('id')
    +                     .agg(mean(df.v).alias('mean_udf(v)'),
    +                          sum(df.v).alias('sum_udf(v)'),
    +                          mean(df.v).alias('weighted_mean_udf(v, w)'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.groupBy('id', 'v')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.id))
    +                   .sort('id', 'v')
    +                   .toPandas())
    +
    +        expected2 = (df.groupBy('id', 'v')
    +                     .agg(mean_udf(df.v).alias('mean_udf(v)'),
    +                          sum_udf(df.id).alias('sum_udf(id)'))
    +                     .sort('id', 'v')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +
    +    def test_complex(self):
    +        from pyspark.sql.functions import col, sum
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.withColumn('v1', plus_one(df.v))
    +                   .withColumn('v2', df.v + 2)
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')),
    +                        sum_udf(col('v1') + 3),
    +                        sum_udf(col('v2')) + 5,
    +                        plus_one(sum_udf(col('v1'))),
    +                        sum_udf(plus_one(col('v2'))))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.withColumn('v1', df.v + 1)
    +                     .withColumn('v2', df.v + 2)
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_udf(v)'),
    +                          sum(col('v1') + 3).alias('sum_udf((v1 + 3))'),
    +                          (sum(col('v2')) + 5).alias('(sum_udf(v2) + 5)'),
    +                          plus_one(sum(col('v1'))).alias('plus_one(sum_udf(v1))'),
    +                          sum(col('v2') + 1).alias('sum_udf(plus_one(v2))'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.withColumn('v1', plus_one(df.v))
    +                   .withColumn('v2', df.v + 2)
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')),
    +                        sum_udf(col('v1') + 3),
    +                        sum_udf(col('v2')) + 5,
    +                        plus_two(sum_udf(col('v1'))),
    +                        sum_udf(plus_two(col('v2'))))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected2 = (df.withColumn('v1', df.v + 1)
    +                     .withColumn('v2', df.v + 2)
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_udf(v)'),
    +                          sum(col('v1') + 3).alias('sum_udf((v1 + 3))'),
    +                          (sum(col('v2')) + 5).alias('(sum_udf(v2) + 5)'),
    +                          plus_two(sum(col('v1'))).alias('plus_two(sum_udf(v1))'),
    +                          sum(col('v2') + 2).alias('sum_udf(plus_two(v2))'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(df.v).alias('v'))
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')).alias('sum_v'))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v).alias('v'))
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_v'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +        self.assertPandasEqual(expected3, result3)
    +
    --- End diff --
    
    Added `test_complex_grouping`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    @ueshin I think all comments are addressed. Can you take a final look? Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r160678737
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4052,6 +4045,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result5 = (df.groupby(plus_one(df.id))
    +                   .agg(plus_one(sum_udf(plus_one(df.v))))
    +                   .sort('plus_one(id)'))
    +        expected5 = (df.groupby(plus_one(df.id))
    +                     .agg(plus_one(sum(plus_one(df.v))).alias('plus_one(sum_udf(plus_one(v)))'))
    +                     .sort('plus_one(id)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +
    +    def test_multiple(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +        sum_udf = self.sum_udf
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = (df.groupBy('id')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.v),
    +                        weighted_mean_udf(df.v, df.w))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.groupBy('id')
    +                     .agg(mean(df.v).alias('mean_udf(v)'),
    +                          sum(df.v).alias('sum_udf(v)'),
    +                          mean(df.v).alias('weighted_mean_udf(v, w)'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.groupBy('id', 'v')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.id))
    +                   .sort('id', 'v')
    +                   .toPandas())
    +
    +        expected2 = (df.groupBy('id', 'v')
    +                     .agg(mean_udf(df.v).alias('mean_udf(v)'),
    +                          sum_udf(df.id).alias('sum_udf(id)'))
    +                     .sort('id', 'v')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +
    +    def test_complex_grouping(self):
    +        from pyspark.sql.functions import lit, sum
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +
    +        result1 = df.groupby(df.id + 1).agg(sum_udf(df.v))
    +        expected1 = df.groupby(df.id + 1).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result2 = df.groupby().agg(sum_udf(df.v))
    +        expected2 = df.groupby().agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v))
    +        expected3 = df.groupby(df.id, df.v % 2).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result4 = df.groupby(plus_one(df.id)).agg(sum_udf(df.v))
    +        expected4 = df.groupby(plus_one(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result5 = df.groupby(plus_two(df.id)).agg(sum_udf(df.v))
    +        expected5 = df.groupby(plus_two(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        result6 = df.groupby(df.id, plus_one(df.id)).agg(sum_udf(df.v))
    +        expected6 = df.groupby(df.id, plus_one(df.id)).agg(sum(df.v).alias('sum_udf(v)'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +        self.assertPandasEqual(expected5.toPandas(), result5.toPandas())
    +        self.assertPandasEqual(expected6.toPandas(), result6.toPandas())
    +
    +    def test_complex_expression(self):
    +        from pyspark.sql.functions import col, sum
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.withColumn('v1', plus_one(df.v))
    +                   .withColumn('v2', df.v + 2)
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')),
    +                        sum_udf(col('v1') + 3),
    +                        sum_udf(col('v2')) + 5,
    +                        plus_one(sum_udf(col('v1'))),
    +                        sum_udf(plus_one(col('v2'))))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.withColumn('v1', df.v + 1)
    +                     .withColumn('v2', df.v + 2)
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_udf(v)'),
    +                          sum(col('v1') + 3).alias('sum_udf((v1 + 3))'),
    +                          (sum(col('v2')) + 5).alias('(sum_udf(v2) + 5)'),
    +                          plus_one(sum(col('v1'))).alias('plus_one(sum_udf(v1))'),
    +                          sum(plus_one(col('v2'))).alias('sum_udf(plus_one(v2))'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.withColumn('v1', plus_one(df.v))
    +                   .withColumn('v2', df.v + 2)
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')),
    +                        sum_udf(col('v1') + 3),
    +                        sum_udf(col('v2')) + 5,
    +                        plus_two(sum_udf(col('v1'))),
    +                        sum_udf(plus_two(col('v2'))))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected2 = (df.withColumn('v1', df.v + 1)
    +                     .withColumn('v2', df.v + 2)
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_udf(v)'),
    +                          sum(col('v1') + 3).alias('sum_udf((v1 + 3))'),
    +                          (sum(col('v2')) + 5).alias('(sum_udf(v2) + 5)'),
    +                          plus_two(sum(col('v1'))).alias('plus_two(sum_udf(v1))'),
    +                          sum(plus_two(col('v2'))).alias('sum_udf(plus_two(v2))'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(df.v).alias('v'))
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')).alias('sum_v'))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v).alias('v'))
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_v'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +        self.assertPandasEqual(expected3, result3)
    +
    +    def test_retain_group_columns(self):
    +        from pyspark.sql.functions import sum, lit, col
    +        orig_value = self.spark.conf.get("spark.sql.retainGroupColumns", None)
    +        self.spark.conf.set("spark.sql.retainGroupColumns", False)
    +        try:
    +            df = self.data
    +            sum_udf = self.sum_udf
    +
    +            result1 = df.groupby(df.id).agg(sum_udf(df.v))
    +            expected1 = df.groupby(df.id).agg(sum(df.v).alias('sum_udf(v)'))
    +            self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        finally:
    +            if orig_value is None:
    +                self.spark.conf.unset("spark.sql.retainGroupColumns")
    +            else:
    +                self.spark.conf.set("spark.sql.retainGroupColumns", orig_value)
    +
    +    def test_invalid_args(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        mean_udf = self.mean_udf
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(
    +                    AnalysisException,
    +                    'nor.*aggregate function'):
    +                df.groupby(df.id).agg(plus_one(df.v)).collect()
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(
    +                    AnalysisException,
    +                    'aggregate function.*argument.*aggregate function'):
    +                df.groupby(df.id).agg(mean_udf(mean_udf(df.v))).collect()
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(
    +                    Exception,
    --- End diff --
    
    Shall we catch narrower exception?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85928/testReport)** for PR 19872 at commit [`cb36227`](https://github.com/apache/spark/commit/cb362274711c1b26ed19e87aa15bc8c64668eae6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165261550
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4353,6 +4347,446 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_manual(self):
    +        df = self.data
    +        sum_udf = self.pandas_agg_sum_udf
    +        mean_udf = self.pandas_agg_mean_udf
    +
    +        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        expected1 = self.spark.createDataFrame(
    +            [[0, 245.0, 24.5],
    +             [1, 255.0, 25.5],
    +             [2, 265.0, 26.5],
    +             [3, 275.0, 27.5],
    +             [4, 285.0, 28.5],
    +             [5, 295.0, 29.5],
    +             [6, 305.0, 30.5],
    +             [7, 315.0, 31.5],
    +             [8, 325.0, 32.5],
    +             [9, 335.0, 33.5]],
    +            ['id', 'sum(v)', 'avg(v)'])
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF with literal
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        # Groupby one column and aggregate one UDF without literal
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF without literal
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_unsupported_types(self):
    +        from pyspark.sql.types import ArrayType, DoubleType, MapType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegex(NotImplementedError, 'not supported'):
    --- End diff --
    
    I'll file the follow-up pr to fix it soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85530/testReport)** for PR 19872 at commit [`eeed9be`](https://github.com/apache/spark/commit/eeed9be030402ef73c56a2efc0f72e0d67945165).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r156038036
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -15,10 +15,9 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.python
    +package org.apache.spark.sql.catalyst.expressions
    --- End diff --
    
    Do we need to move package to catalyst?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154809806
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -89,8 +89,15 @@ def agg(self, *exprs):
             else:
                 # Columns
                 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
    -            jdf = self._jgd.agg(exprs[0]._jc,
    -                                _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
    +            if isinstance(exprs[0], UDFColumn):
    +                assert all(isinstance(c, UDFColumn) for c in exprs)
    --- End diff --
    
    I am still trying to figure out the best way to dispatch this, but either way I think we won't be able to fix Java UDAF with pandas UDF.
    
    @holdenk I am not sure what kind of warning message do you have in mind. Can you please explain?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    I end up removing `UDFColumn` class and using the existing `Aggregate` logical plan for pandas group_agg UDF. This reuses a lot of code being to existing `Aggregate` and minimize the code changes needed for pandas group_agg UDF.
    
    The code works and three tests (test_basic, test_alias, test_multiple) passes now but the code is kind of messy. I am going on vacation next week but I will clean up the code and move this PR forward when I get back (Dec 16).
    
    Thanks all.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165268323
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4353,6 +4347,446 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_manual(self):
    +        df = self.data
    +        sum_udf = self.pandas_agg_sum_udf
    +        mean_udf = self.pandas_agg_mean_udf
    +
    +        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        expected1 = self.spark.createDataFrame(
    +            [[0, 245.0, 24.5],
    +             [1, 255.0, 25.5],
    +             [2, 265.0, 26.5],
    +             [3, 275.0, 27.5],
    +             [4, 285.0, 28.5],
    +             [5, 295.0, 29.5],
    +             [6, 305.0, 30.5],
    +             [7, 315.0, 31.5],
    +             [8, 325.0, 32.5],
    +             [9, 335.0, 33.5]],
    +            ['id', 'sum(v)', 'avg(v)'])
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF with literal
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        # Groupby one column and aggregate one UDF without literal
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF without literal
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_unsupported_types(self):
    +        from pyspark.sql.types import ArrayType, DoubleType, MapType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegex(NotImplementedError, 'not supported'):
    --- End diff --
    
    @yhuai, if you meant not running tests in Python 2, this link might be helpful. Let me leave it just in case - https://github.com/apache/spark/pull/19884#issuecomment-352730177.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162980691
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2221,6 +2223,35 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The `returnType` should be a primitive data type, e.g, :class:`DoubleType`.
    --- End diff --
    
    Fixed. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86345 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86345/testReport)** for PR 19872 at commit [`17fad5c`](https://github.com/apache/spark/commit/17fad5c0f83edb142471e2c4a1ffad08d7a29c5d).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157825297
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -0,0 +1,143 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.{SparkEnv, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, NamedExpression, PythonUDF, SortOrder, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning}
    +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
    +import org.apache.spark.sql.types.{DataType, StructField, StructType}
    +import org.apache.spark.util.Utils
    +
    +case class AggregateInPandasExec(
    +    groupingExpressions: Seq[Expression],
    +    udfExpressions: Seq[PythonUDF],
    +    resultExpressions: Seq[NamedExpression],
    +    child: SparkPlan)
    +  extends UnaryExecNode {
    +
    +  override def output: Seq[Attribute] = resultExpressions.map(_.toAttribute)
    +
    +  override def outputPartitioning: Partitioning = child.outputPartitioning
    +
    +  override def producedAttributes: AttributeSet = AttributeSet(output)
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (groupingExpressions.isEmpty) {
    +      AllTuples :: Nil
    +    } else {
    +      ClusteredDistribution(groupingExpressions) :: Nil
    +    }
    +  }
    +
    +  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = {
    +    udf.children match {
    +      case Seq(u: PythonUDF) =>
    +        val (chained, children) = collectFunctions(u)
    +        (ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
    +      case children =>
    +        // There should not be any other UDFs, or the children can't be evaluated directly.
    +        assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
    +        (ChainedPythonFunctions(Seq(udf.func)), udf.children)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
    +    Seq(groupingExpressions.map(SortOrder(_, Ascending)))
    +
    +  override protected def doExecute(): RDD[InternalRow] = {
    +    val inputRDD = child.execute()
    +
    +    val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
    +    val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val sessionLocalTimeZone = conf.sessionLocalTimeZone
    +    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +
    +    val (pyFuncs, inputs) = udfExpressions.map(collectFunctions).unzip
    +
    +    val allInputs = new ArrayBuffer[Expression]
    +    val dataTypes = new ArrayBuffer[DataType]
    +
    +    allInputs.appendAll(groupingExpressions)
    +
    +    val argOffsets = inputs.map { input =>
    +      input.map { e =>
    +          allInputs += e
    +          dataTypes += e.dataType
    +          allInputs.length - 1 - groupingExpressions.length
    +      }.toArray
    +    }.toArray
    +
    +    val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
    +      StructField(s"_$i", dt)
    +    })
    +
    +    inputRDD.mapPartitionsInternal { iter =>
    +      val grouped = if (groupingExpressions.isEmpty) {
    +        Iterator((null, iter))
    +      } else {
    +        val groupedIter = GroupedIterator(iter, groupingExpressions, child.output)
    +
    +        val dropGrouping =
    +          UnsafeProjection.create(allInputs.drop(groupingExpressions.length), child.output)
    +
    +        groupedIter.map {
    +          case (k, groupedRowIter) => (k, groupedRowIter.map(dropGrouping))
    +        }
    +      }
    +
    +      val context = TaskContext.get()
    +
    +      // The queue used to buffer input rows so we can drain it to
    +      // combine input with output from Python.
    +      val queue = HybridRowQueue(context.taskMemoryManager(),
    +        new File(Utils.getLocalDir(SparkEnv.get.conf)), groupingExpressions.length)
    +      context.addTaskCompletionListener { _ =>
    +        queue.close()
    +      }
    +
    +      // Add rows to queue to join later with the result.
    +      val projectedRowIter = grouped.map { case (groupingKey, rows) =>
    +        queue.add(groupingKey.asInstanceOf[UnsafeRow])
    +        rows
    +      }
    +
    +      val columnarBatchIter = new ArrowPythonRunner(
    +        pyFuncs, bufferSize, reuseWorker,
    +        PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF, argOffsets, schema,
    +        sessionLocalTimeZone, pandasRespectSessionTimeZone)
    +        .compute(projectedRowIter, context.partitionId(), context)
    +
    +      val joined = new JoinedRow
    +      val resultProj = UnsafeProjection.create(output, output)
    +
    +      columnarBatchIter.map(_.rowIterator.next()).map{ outputRow =>
    --- End diff --
    
    ```
    columnarBatchIter.flatMap(_.rowIterator)
    ```
    Doesn't work because rowIterator is a java iterator not a scala iterator, we can convert it, but I am not sure it's better though. @ueshin if you prefer the flatMap one I can change it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84631/testReport)** for PR 19872 at commit [`4332f28`](https://github.com/apache/spark/commit/4332f28bc32ea07c6ba5e55b4d66d70498d29abd).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161500785
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    --- End diff --
    
    `test_mixed_udfs` .. ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154569899
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -89,8 +89,15 @@ def agg(self, *exprs):
             else:
                 # Columns
                 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
    -            jdf = self._jgd.agg(exprs[0]._jc,
    -                                _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
    +            if isinstance(exprs[0], UDFColumn):
    +                assert all(isinstance(c, UDFColumn) for c in exprs)
    --- End diff --
    
    A informative error message should be better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85605/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86350/testReport)** for PR 19872 at commit [`4d22107`](https://github.com/apache/spark/commit/4d22107cabb9683d9d1dcd8c03a4a6e45f34a909).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PySpark] User-defined aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85947/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161856360
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala ---
    @@ -15,12 +15,31 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.python
    +package org.apache.spark.sql.catalyst.expressions
     
    -import org.apache.spark.api.python.PythonFunction
    -import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable, UserDefinedExpression}
    +import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
    +import org.apache.spark.sql.catalyst.util.toPrettySQL
     import org.apache.spark.sql.types.DataType
     
    +/**
    + * Helper functions for PythonUDF
    --- End diff --
    
    I changed to `[[PythonUDF]]`. I think Scala doc should use `[[]]`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162429623
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4273,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    --- End diff --
    
    I added a `test_manual` to compute the results manually and compare.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157891450
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
               df.logicalPlan))
       }
     
    +
    +  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
    +    val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
    +
    +    val groupingNamedExpressions = groupingExprs.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    +    }
    +
    +    val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
    +
    +    val child = df.logicalPlan
    +
    +    val childrenExpressions = exprs.flatMap(expr =>
    +      expr.children.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    --- End diff --
    
    Removed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162635572
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4273,425 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    --- End diff --
    
    Ah. No worries. Thanks for clarification.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154644340
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
               df.logicalPlan))
       }
     
    +
    +  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
    +    val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
    +
    +    val groupingNamedExpressions = groupingExprs.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    +    }
    +
    +    val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
    +
    +    val child = df.logicalPlan
    +
    +    val childrenExpressions = exprs.flatMap(expr =>
    +      expr.children.map {
    +      case ne: NamedExpression => ne
    +      case other => Alias(other, other.toString)()
    +    })
    +
    +    val project = Project(groupingNamedExpressions ++ childrenExpressions, child)
    +
    +    val udfOutputs = exprs.flatMap(expr =>
    +      Seq(AttributeReference(expr.name, expr.dataType)())
    +    )
    --- End diff --
    
    I think this could be inlined.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85442/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #85152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85152/testReport)** for PR 19872 at commit [`ea5d6f3`](https://github.com/apache/spark/commit/ea5d6f319aa3b1bba20ad86a51e6efb65658e3d2).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86188/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r154644620
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4016,6 +4016,89 @@ def test_unsupported_types(self):
                 with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
                     df.groupby('id').apply(f).collect()
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +    def assertFramesEqual(self, expected, result):
    +        msg = ("DataFrames are not equal: " +
    +               ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +
    --- End diff --
    
    indentation nit


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86187/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158912955
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4052,6 +4066,323 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, float)
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        self.spark.conf.set("spark.sql.codegen.wholeStage", False)
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean_udf(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean_udf(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_array(self):
    +        from pyspark.sql.types import ArrayType, DoubleType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return [v.mean(), v.std()]
    +
    +    def test_struct(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    +                @pandas_udf('mean double, std double', PandasUDFType.GROUP_AGG)
    +                def mean_and_std_udf(v):
    +                    return (v.mean(), v.std())
    +
    +    def test_alias(self):
    +        from pyspark.sql.functions import mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +
    +        result1 = df.groupby('id').agg(mean_udf(df.v).alias('mean_alias'))
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('mean_alias'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_mixed_sql(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(sum_udf(df.v) + 1)
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg((sum(df.v) + 1).alias('(sum_udf(v) + 1)'))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                     .agg(sum_udf(df.v + 1))
    +                     .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                       .agg(sum(df.v + 1).alias('sum_udf((v + 1))'))
    +                       .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +    def test_mixed_udf(self):
    +        from pyspark.sql.functions import sum, mean
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.groupby('id')
    +                   .agg(plus_one(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected1 = (df.groupby('id')
    +                     .agg(plus_one(sum(df.v)).alias("plus_one(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        result2 = (df.groupby('id')
    +                   .agg(sum_udf(plus_one(df.v)))
    +                   .sort('id'))
    +
    +        expected2 = (df.groupby('id')
    +                     .agg(sum(df.v + 1).alias("sum_udf(plus_one(v))"))
    +                     .sort('id'))
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(plus_two(df.v)))
    +                   .sort('id'))
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v + 2).alias("sum_udf(plus_two(v))"))
    +                     .sort('id'))
    +
    +        result4 = (df.groupby('id')
    +                   .agg(plus_two(sum_udf(df.v)))
    +                   .sort('id'))
    +
    +        expected4 = (df.groupby('id')
    +                     .agg(plus_two(sum(df.v)).alias("plus_two(sum_udf(v))"))
    +                     .sort('id'))
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_multiple(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        mean_udf = self.mean_udf
    +        sum_udf = self.sum_udf
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = (df.groupBy('id')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.v),
    +                        weighted_mean_udf(df.v, df.w))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.groupBy('id')
    +                     .agg(mean(df.v).alias('mean_udf(v)'),
    +                          sum(df.v).alias('sum_udf(v)'),
    +                          mean(df.v).alias('weighted_mean_udf(v, w)'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.groupBy('id', 'v')
    +                   .agg(mean_udf(df.v),
    +                        sum_udf(df.id))
    +                   .sort('id', 'v')
    +                   .toPandas())
    +
    +        expected2 = (df.groupBy('id', 'v')
    +                     .agg(mean_udf(df.v).alias('mean_udf(v)'),
    +                          sum_udf(df.id).alias('sum_udf(id)'))
    +                     .sort('id', 'v')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +
    +    def test_complex(self):
    +        from pyspark.sql.functions import col, sum
    +
    +        df = self.data
    +        plus_one = self.plus_one
    +        plus_two = self.plus_two
    +        sum_udf = self.sum_udf
    +
    +        result1 = (df.withColumn('v1', plus_one(df.v))
    +                   .withColumn('v2', df.v + 2)
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')),
    +                        sum_udf(col('v1') + 3),
    +                        sum_udf(col('v2')) + 5,
    +                        plus_one(sum_udf(col('v1'))),
    +                        sum_udf(plus_one(col('v2'))))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected1 = (df.withColumn('v1', df.v + 1)
    +                     .withColumn('v2', df.v + 2)
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_udf(v)'),
    +                          sum(col('v1') + 3).alias('sum_udf((v1 + 3))'),
    +                          (sum(col('v2')) + 5).alias('(sum_udf(v2) + 5)'),
    +                          plus_one(sum(col('v1'))).alias('plus_one(sum_udf(v1))'),
    +                          sum(col('v2') + 1).alias('sum_udf(plus_one(v2))'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result2 = (df.withColumn('v1', plus_one(df.v))
    +                   .withColumn('v2', df.v + 2)
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')),
    +                        sum_udf(col('v1') + 3),
    +                        sum_udf(col('v2')) + 5,
    +                        plus_two(sum_udf(col('v1'))),
    +                        sum_udf(plus_two(col('v2'))))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected2 = (df.withColumn('v1', df.v + 1)
    +                     .withColumn('v2', df.v + 2)
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_udf(v)'),
    +                          sum(col('v1') + 3).alias('sum_udf((v1 + 3))'),
    +                          (sum(col('v2')) + 5).alias('(sum_udf(v2) + 5)'),
    +                          plus_two(sum(col('v1'))).alias('plus_two(sum_udf(v1))'),
    +                          sum(col('v2') + 2).alias('sum_udf(plus_two(v2))'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        result3 = (df.groupby('id')
    +                   .agg(sum_udf(df.v).alias('v'))
    +                   .groupby('id')
    +                   .agg(sum_udf(col('v')).alias('sum_v'))
    +                   .sort('id')
    +                   .toPandas())
    +
    +        expected3 = (df.groupby('id')
    +                     .agg(sum(df.v).alias('v'))
    +                     .groupby('id')
    +                     .agg(sum(col('v')).alias('sum_v'))
    +                     .sort('id')
    +                     .toPandas())
    +
    +        self.assertPandasEqual(expected1, result1)
    +        self.assertPandasEqual(expected2, result2)
    +        self.assertPandasEqual(expected3, result3)
    +
    --- End diff --
    
    Can you add tests for complex/empty groupby like `GroupbyApplyTests.test_complex_groupby` or `GroupbyApplyTests.test_empty_groupby`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162372467
  
    --- Diff: python/pyspark/worker.py ---
    @@ -110,6 +110,17 @@ def wrapped(*series):
         return wrapped
     
     
    +def wrap_pandas_group_agg_udf(f, return_type):
    +    arrow_return_type = to_arrow_type(return_type)
    +
    +    def wrapped(*series):
    +        import pandas as pd
    +        result = f(*series)
    +        return pd.Series(result)
    --- End diff --
    
    @HyukjinKwon is right. I am not sure it's worth it performance wise to have another ser/de because the overhead is proportional of the number of groups instead of number of rows. And it seems pretty fast too.
    
    ```
    %%time
    stream = io.BytesIO()
    
    for i in range(0, 1000):
        batch = _create_batch(pd.Series(i), None)
        writer = pa.RecordBatchStreamWriter(stream, batch.schema)
        writer.write_batch(batch)
        writer.close()
    
    CPU times: user 266 ms, sys: 12.7 ms, total: 279 ms
    Wall time: 281 ms
    ```
    This is overhead of <1ms per group
     


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157891787
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,9 +48,26 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case _ @ PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF ) => true
    +      case Alias(expr, _) => isPandasGroupAggUdf(expr)
    +      case _ => false
    +  }
    +
    +  private def hasPandasGroupAggUdf(agg: Aggregate): Boolean = {
    +    val actualAggExpr = agg.aggregateExpressions.drop(agg.groupingExpressions.length)
    +    actualAggExpr.exists(isPandasGroupAggUdf)
    +  }
    +
    +
       private def extract(agg: Aggregate): LogicalPlan = {
         val projList = new ArrayBuffer[NamedExpression]()
         val aggExpr = new ArrayBuffer[NamedExpression]()
    +
    +    if (hasPandasGroupAggUdf(agg)) {
    +      Aggregate(agg.groupingExpressions, agg.aggregateExpressions, agg.child)
    +    } else {
    +
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161517448
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    --- End diff --
    
    Or .. `PandasUDF` prefix? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86487/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84415/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158952610
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala ---
    @@ -171,6 +171,7 @@ trait CheckAnalysis extends PredicateHelper {
                             s"appear in the arguments of an aggregate function.")
                       }
                     }
    +              case _: PythonUDF => // OK
    --- End diff --
    
    Ah I was gonna add the check but forgot, my bad. Let me add proper check for PythonUDF.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r157948426
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -48,29 +48,46 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
         }.isDefined
       }
     
    +  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
    +      case PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF) => true
    +      case Alias(child, _) => isPandasGroupAggUdf(child)
    +      case _ => false
    +  }
    +
    +  private def hasPandasGroupAggUdf(agg: Aggregate): Boolean = {
    +    val actualAggExpr = agg.aggregateExpressions.drop(agg.groupingExpressions.length)
    +    actualAggExpr.exists(isPandasGroupAggUdf)
    +  }
    +
    +
    --- End diff --
    
    nit: remove an extra line.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    @ramacode2014 Hi, I'm not sure why you received notifications from this PR, but I guess you can unsubscribe by the "Unsubscribe" button in the right column of this page. Sorry for the inconvenience. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86196/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161854960
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def mean_udf(v):
    +            return v.mean()
    +        return mean_udf
    +
    +    @property
    +    def sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum_udf(v):
    +            return v.sum()
    +        return sum_udf
    +
    +    @property
    +    def weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean_udf(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean_udf
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.weighted_mean_udf
    +
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    --- End diff --
    
    Sorry! Yes I added comments to each test case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85138/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161854824
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4279,6 +4272,386 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def plus_one(self):
    --- End diff --
    
    Added prefix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86344/testReport)** for PR 19872 at commit [`a94b146`](https://github.com/apache/spark/commit/a94b14671faa39295ecb26a0c59c34154384c07c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158913244
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -215,3 +228,49 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
         }
       }
     }
    +
    +
    +/**
    + * Extract all the group aggregate Pandas UDFs in logical aggregation, evaluate the UDFs first
    + * and then the expressions that depend on the result of the UDFs.
    + */
    +object ExtractGroupAggPandasUDFFromAggregate extends Rule[LogicalPlan] {
    --- End diff --
    
    I guess this rule name is not representing what this is doing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #86346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86346/testReport)** for PR 19872 at commit [`0fec5cf`](https://github.com/apache/spark/commit/0fec5cf86619f0a42647c1c53b4cb5b3d449ecd8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162402735
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None):
     
            .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
     
    +    3. GROUP_AGG
    +
    +       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
    +       The returnType should be a primitive data type, e.g, `DoubleType()`.
    +       The returned scalar can be either a python primitive type, e.g., `int` or `float`
    +       or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
    +
    +       StructType and ArrayType are currently not supported.
    +
    +       Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg`
    +
    +       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +       >>> df = spark.createDataFrame(
    +       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    +       ...     ("id", "v"))
    +       >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
    +       ... def mean_udf(v):
    +       ...     return v.mean()
    +       >>> df.groupby("id").agg(mean_udf(df['v'])).show()  # doctest: +SKIP
    +       +---+-----------+
    +       | id|mean_udf(v)|
    +       +---+-----------+
    +       |  1|        1.5|
    +       |  2|        6.0|
    +       +---+-----------+
    +
    +       .. note:: There is no partial aggregation with group aggregate UDFs, i.e.,
    +           a full shuffle is required.
    --- End diff --
    
    Added


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    **[Test build #84414 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84414/testReport)** for PR 19872 at commit [`4cfaf0e`](https://github.com/apache/spark/commit/4cfaf0e9723bcfbb74dfd1b9d1f5e30682bf072f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161503314
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -111,6 +111,10 @@ def returnType(self):
                     and not isinstance(self._returnType_placeholder, StructType):
                 raise ValueError("Invalid returnType: returnType must be a StructType for "
                                  "pandas_udf with function type GROUP_MAP")
    +        elif self.evalType == PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF \
    +                and isinstance(self._returnType_placeholder, (StructType, ArrayType)):
    --- End diff --
    
    Hm .. i think we don't support `MapType` too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    LGTM.
    @HyukjinKwon Do you have any concerns about this?
    I'd also cc @cloud-fan for another look.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r158902463
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -39,13 +39,16 @@ private[spark] object PythonEvalType {
     
       val SQL_PANDAS_SCALAR_UDF = 200
       val SQL_PANDAS_GROUP_MAP_UDF = 201
    +  val SQL_PANDAS_GROUP_AGG_UDF = 202
     
       def toString(pythonEvalType: Int): String = pythonEvalType match {
         case NON_UDF => "NON_UDF"
         case SQL_BATCHED_UDF => "SQL_BATCHED_UDF"
         case SQL_PANDAS_SCALAR_UDF => "SQL_PANDAS_SCALAR_UDF"
         case SQL_PANDAS_GROUP_MAP_UDF => "SQL_PANDAS_GROUP_MAP_UDF"
    +    case SQL_PANDAS_GROUP_AGG_UDF => "SQL_PANDAS_GROUP_AGG_UDF"
       }
    +
    --- End diff --
    
    nit: remove an extra line.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r165253818
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4353,6 +4347,446 @@ def test_unsupported_types(self):
                     df.groupby('id').apply(f).collect()
     
     
    +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
    +class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
    +
    +    @property
    +    def data(self):
    +        from pyspark.sql.functions import array, explode, col, lit
    +        return self.spark.range(10).toDF('id') \
    +            .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
    +            .withColumn("v", explode(col('vs'))) \
    +            .drop('vs') \
    +            .withColumn('w', lit(1.0))
    +
    +    @property
    +    def python_plus_one(self):
    +        from pyspark.sql.functions import udf
    +
    +        @udf('double')
    +        def plus_one(v):
    +            assert isinstance(v, (int, float))
    +            return v + 1
    +        return plus_one
    +
    +    @property
    +    def pandas_scalar_plus_two(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.SCALAR)
    +        def plus_two(v):
    +            assert isinstance(v, pd.Series)
    +            return v + 2
    +        return plus_two
    +
    +    @property
    +    def pandas_agg_mean_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def avg(v):
    +            return v.mean()
    +        return avg
    +
    +    @property
    +    def pandas_agg_sum_udf(self):
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def sum(v):
    +            return v.sum()
    +        return sum
    +
    +    @property
    +    def pandas_agg_weighted_mean_udf(self):
    +        import numpy as np
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        @pandas_udf('double', PandasUDFType.GROUP_AGG)
    +        def weighted_mean(v, w):
    +            return np.average(v, weights=w)
    +        return weighted_mean
    +
    +    def test_manual(self):
    +        df = self.data
    +        sum_udf = self.pandas_agg_sum_udf
    +        mean_udf = self.pandas_agg_mean_udf
    +
    +        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        expected1 = self.spark.createDataFrame(
    +            [[0, 245.0, 24.5],
    +             [1, 255.0, 25.5],
    +             [2, 265.0, 26.5],
    +             [3, 275.0, 27.5],
    +             [4, 285.0, 28.5],
    +             [5, 295.0, 29.5],
    +             [6, 305.0, 30.5],
    +             [7, 315.0, 31.5],
    +             [8, 325.0, 32.5],
    +             [9, 335.0, 33.5]],
    +            ['id', 'sum(v)', 'avg(v)'])
    +
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +    def test_basic(self):
    +        from pyspark.sql.functions import col, lit, sum, mean
    +
    +        df = self.data
    +        weighted_mean_udf = self.pandas_agg_weighted_mean_udf
    +
    +        # Groupby one column and aggregate one UDF with literal
    +        result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id')
    +        expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
    +        self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF with literal
    +        result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\
    +            .sort(df.id + 1)
    +        expected2 = df.groupby((col('id') + 1))\
    +            .agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1)
    +        self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
    +
    +        # Groupby one column and aggregate one UDF without literal
    +        result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id')
    +        expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id')
    +        self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
    +
    +        # Groupby one expression and aggregate one UDF without literal
    +        result4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(weighted_mean_udf(df.v, df.w))\
    +            .sort('id')
    +        expected4 = df.groupby((col('id') + 1).alias('id'))\
    +            .agg(mean(df.v).alias('weighted_mean(v, w)'))\
    +            .sort('id')
    +        self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
    +
    +    def test_unsupported_types(self):
    +        from pyspark.sql.types import ArrayType, DoubleType, MapType
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegex(NotImplementedError, 'not supported'):
    --- End diff --
    
    @icexelloss This line does not compile ( we need `assertRaisesRegexp`). Can you file a pr to fix it? Thanks! Meanwhile, we will look into jenkins setup and see why the test was not exercised.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161512986
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -334,34 +339,51 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
       object Aggregation extends Strategy {
         def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
           case PhysicalAggregation(
    --- End diff --
    
    Can we have another `case`  with `case ... if ...`? I think that's going to reduce the diff and makes it easier to manage, review, etc.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86346/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r162097693
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ---
    @@ -27,7 +27,6 @@ import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan, Proj
     import org.apache.spark.sql.catalyst.rules.Rule
     import org.apache.spark.sql.execution.{FilterExec, ProjectExec, SparkPlan}
     
    -
    --- End diff --
    
    Reverted


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19872#discussion_r161495144
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -82,6 +91,13 @@ def agg(self, *exprs):
             >>> from pyspark.sql import functions as F
             >>> sorted(gdf.agg(F.min(df.age)).collect())
             [Row(name=u'Alice', min(age)=2), Row(name=u'Bob', min(age)=5)]
    +
    +        >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    +        >>> @pandas_udf('int', PandasUDFType.GROUP_AGG)
    +        ... def min_udf(v):
    +        ...     return v.min()
    +        >>> sorted(gdf.agg(min_udf(df.age)).collect())  # doctest: +SKIP
    --- End diff --
    
    That's fine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregation func...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19872
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86344/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org