You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/06 22:07:42 UTC

[GitHub] [spark] rdblue opened a new pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

rdblue opened a new pull request #24559:
URL: https://github.com/apache/spark/pull/24559


   ## What changes were proposed in this pull request?
   
   This adds a new API for catalog plugins that exposes functions to Spark. The API can list and load functions. This does not include create, delete, or alter operations.
   
   There are 3 types of functions defined:
   * A `ScalarFunction` that produces a value for every call
   * An `AggregateFunction` that produces a value after updates for a group of rows
   * An `AssociativeAggregateFunction` that is an aggregate that can be run in parallel and can merge intermediate results
   
   Functions loaded from the catalog by name as `UnboundFunction`. Once input arguments are determined `bind` is called on the unbound function to get a `BoundFunction` implementation that is one of the 3 types above. Binding can fail if the function doesn't support the input type. `BoundFunction` returns the result type produced by the function.
   
   ## How was this patch tested?
   
   This includes a test that demonstrates the new API.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808625841


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136576/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814432311






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774369530


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134948/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587495179



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       Another important logic is how to serialize the buffer. As I mentioned before, Spark requires aggregate functions to implement partial merge, so serializing the buffer and shuffle it always happens.
   
   It's inefficient to simply use a java serializer. The aggregate buffer is also an `InternalRow`. Assuming you are implementing the `avg` function,  and the buffer includes 2 values: `sum` and `count`. It's inefficient to serialize `Tuple2(sum, count)` to a binary and save it in the buffer row. We should just write `sum` and `count` as 2 columns to the buffer row.
   
   I'd propose something like
   ```
   public interface AggregateFunction<R> extends BoundFunction {
     AggregationState newAggregationState();
     ...
   }
   
   public interface AggregationState {
     StructType schema();
     InternalRow serialize();
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731460261


   @wmoustafa, it sounds like we're in agreement that Spark should have a FunctionCatalog API, and Transport would provide a richer way for people to write function libraries that plugs in through Transport.
   
   I don't know very much about Transport, so I'm afraid I would mostly be speculating about it. But I don't see a reason why it wouldn't work through this proposed API, especially if it already has a way to work with Spark types and internal rows.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731497814


   **[Test build #131457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131457/testReport)** for PR 24559 at commit [`7fc5ce8`](https://github.com/apache/spark/commit/7fc5ce87406e24030da8dff4eaaaf9854c7ff632).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r596497034



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       One issue I found with the `Serializable` approach is that currently in Spark the `SerializerInstance` as well as `ExpressionEncoder` all require `ClassTag`, which is not available from Java. This makes it hard to reuse the existing machinery in Spark for the serialization/deserialization work. Another issue, which is reflected by the CI failure, is that simple classes such as:
   ```scala
   class IntAverage extends AggregateFunction[(Int, Int), Int]
   ```
   ~~will not work out-of-box, as `(Int, Int)` doesn't implement `Serializable`~~. Edit: sorry NVM on this one - `TupleN` does implement `Serializable` and the test failure is due to something else.
   
   The `ClassTag` constraint for `SerializerInstance` was added in #700 for supporting Scala Pickling as one of the serializer implementation but seems the PR never ended in Spark, so not quite sure if it is still needed today. Thanks @viirya for having a offline discussion with me on this.
   
   Because of this, I'm wondering if it makes sense to replace the `Serializable` with something else, such as another method:
   ```java
   Encoder<S> encoder();
   ```
   This can be implemented pretty easily by Spark users with [`Encoders`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala). The approach is similar to the `udaf` API today. For Scala users, we can optionally provide another version of `AggregateFunction` in Scala with implicit, so users don't need to do this.
   
   Would like to hear your opinion on this @rdblue @cloud-fan 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

xkrogen commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r604261124



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/FunctionCatalog.java
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import org.apache.spark.sql.catalyst.analysis.NoSuchFunctionException;
+import org.apache.spark.sql.catalyst.analysis.NoSuchNamespaceException;
+import org.apache.spark.sql.connector.catalog.functions.UnboundFunction;
+
+/**
+ * Catalog methods for working with Functions.
+ */
+public interface FunctionCatalog extends CatalogPlugin {
+
+  /**
+   * List the functions in a namespace from the catalog.

Review comment:
       Should we specify expected behavior if no identifiers are found (i.e. return null vs. empty array)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731426534


   **[Test build #131447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131447/testReport)** for PR 24559 at commit [`b394e8b`](https://github.com/apache/spark/commit/b394e8b2c0c0cf11e572a9b13c763cf71d3ecd11).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731496928


   Retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729994512


   **[Test build #131304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131304/testReport)** for PR 24559 at commit [`b3ba28d`](https://github.com/apache/spark/commit/b3ba28d95064f6e532680c88264e8b352c2e3e60).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r588786051



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       `PartialAggregateFunction` already have this. Alternatively I'm wondering whether we can do:
   
   ```scala
   ByteBuffer serialize(state: S);
   S deserialize(buf: ByteBuffer);
   ```
   following the `TypedImperativeAggregate` interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729992086


   **[Test build #131304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131304/testReport)** for PR 24559 at commit [`b3ba28d`](https://github.com/apache/spark/commit/b3ba28d95064f6e532680c88264e8b352c2e3e60).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r605229735



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/UnboundFunction.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * Represents a user-defined function that is not bound to input types.
+ */
+public interface UnboundFunction extends Function {
+
+  /**
+   * Bind this function to an input type.
+   * <p>
+   * If the input type is not supported, implementations must throw
+   * {@link UnsupportedOperationException}.
+   * <p>
+   * For example, a "length" function that only supports a single string argument should throw
+   * UnsupportedOperationException if the struct has more than one field or if that field is not a
+   * string, and it may optionally throw if the field is nullable.
+   *
+   * @param inputType a struct type for inputs that will be passed to the bound function
+   * @return a function that can process rows with the given input type
+   * @throws UnsupportedOperationException If the function cannot be applied to the input type
+   */
+  BoundFunction bind(StructType inputType);
+
+  /**
+   * Returns Function documentation.

Review comment:
       I think they're synonymous in this context.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r596497034



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       One issue I found with the `Serializable` approach is that currently in Spark the `SerializerInstance` as well as `ExpressionEncoder` all require `ClassTag`, which is not available from Java. This makes it hard to reuse the existing machinery in Spark for the serialization/deserialization work. Another issue, which is reflected by the CI failure, is that simple classes such as:
   ```scala
   class IntAverage extends AggregateFunction[(Int, Int), Int]
   ```
   ~~will not work out-of-box, as `(Int, Int)` doesn't implement `Serializable`~~. 
   
   Edit: sorry `TupleN` does implement `Serializable` in Scala, and the issue is (it seems) we can't get a `AggregateFunction` from a `BoundFunction` with the `Serializable` constraint.
   
   The `ClassTag` constraint for `SerializerInstance` was added in #700 for supporting Scala Pickling as one of the serializer implementation but seems the PR never ended in Spark, so not quite sure if it is still needed today. Thanks @viirya for having a offline discussion with me on this.
   
   Because of this, I'm wondering if it makes sense to replace the `Serializable` with something else, such as another method:
   ```java
   Encoder<S> encoder();
   ```
   This can be implemented pretty easily by Spark users with [`Encoders`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala). The approach is similar to the `udaf` API today. For Scala users, we can optionally provide another version of `AggregateFunction` in Scala with implicit, so users don't need to do this.
   
   Would like to hear your opinion on this @rdblue @cloud-fan 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813676379


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41496/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774415544


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134949/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wmoustafa commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

wmoustafa commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731427427


   > I don't think that we would want to build support for a generic framework into Spark itself. I think Spark's API should be specific to Spark, just like the data source APIs are specific to Spark. That avoids complications like converting to Row or another representation for Hive.
   
   I think there are 2 types of APIs: Function Catalog APIs and UDF expression APIs (e.g., Generic UDFs). I mentioned the Transport API as a way to do the latter, and wanted to get your thoughts on the friendly-ship of the Function Catalog APIs to UDF expression APIs like Transport. To the user, Transport provides tools to make type validation and inference user-friendly (declarative using type signatures), and Java types that map to their SQL counterparts. To Spark, it is just an Expression API processing InternalRows.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814407970


   **[Test build #136966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136966/testReport)** for PR 24559 at commit [`bb8f2aa`](https://github.com/apache/spark/commit/bb8f2aa2181ff3270b91d6f15f05e4197f13df32).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814407970


   **[Test build #136966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136966/testReport)** for PR 24559 at commit [`bb8f2aa`](https://github.com/apache/spark/commit/bb8f2aa2181ff3270b91d6f15f05e4197f13df32).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811505637


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41366/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730010810


   **[Test build #131306 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131306/testReport)** for PR 24559 at commit [`1721075`](https://github.com/apache/spark/commit/17210758bb7db883b56b4afcce2704e785620771).
    * This patch **fails to build**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wmoustafa edited a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

wmoustafa edited a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732539897


   In case you are interested in taking a look into what a UDF in Transport boils down to in the case of Spark, you can check [com/linkedin/transport/spark/StdUdfWrapper.scala](https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala). [This UDF](https://github.com/linkedin/transport/blob/wmoustafa-api-v1/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples/MapFromTwoArraysFunction.java) is an example user-facing UDF that is translated to the Expression API. You can think of it as an API on top of an API.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811513472


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41366/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #24559:
URL: https://github.com/apache/spark/pull/24559


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704585446


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34080/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587667679



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {
+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * <p>
+   * For example, a "plus" function may return {@link IntegerType} when it is bound to arguments
+   * that are also {@link IntegerType}.
+   *
+   * @return a data type for values produced by this function
+   */
+  DataType resultType();
+
+  /**
+   * Returns the whether values produced by this function may be null.
+   * <p>
+   * For example, a "plus" function may return false when it is bound to arguments that are always
+   * non-null, but true when either argument may be null.
+   *
+   * @return true if values produced by this function may be null, false otherwise
+   */
+  default boolean isResultNullable() {
+    return true;
+  }
+
+  /**
+   * Returns whether this function result is deterministic.
+   * <p>
+   * By default, functions are assumed to be deterministic. Functions that are not deterministic
+   * should override this method so that Spark can ensure the function runs only once for a given
+   * input.
+   *
+   * @return true if this function is deterministic, false otherwise
+   */
+  default boolean isDeterministic() {
+    return true;
+  }
+
+  /**
+   * Returns the canonical name of this function, used to determine if functions are equivalent.
+   * <p>
+   * The canonical name is used to determine whether two functions are the same when loaded by
+   * different catalogs. For example, the same catalog implementation may be used for by two
+   * environments, "prod" and "test". Functions produced by the catalogs may be equivalent, but
+   * loaded using different names, like "test.func_name" and "prod.func_name".
+   * <p>
+   * Names returned by this function should be unique and unlikely to conflict with similar
+   * functions in other catalogs. For example, many catalogs may define a "bucket" function with a
+   * different implementation. Adding context, like "com.mycompany.bucket(string)", is recommended
+   * to avoid unintentional collisions.
+   *
+   * @return a canonical name for this function
+   */
+  default String canonicalName() {
+    // by default, use a random UUID so a function is never equivalent to another, even itself.
+    // this method is not required so that generated implementations (or careless ones) are not
+    // added and forgotten. for example, returning "" as a place-holder could cause unnecessary
+    // bugs if not replaced before release.
+    return UUID.randomUUID().toString();
+  }

Review comment:
       I have no problem adding it now. It just adds complexity to what we need to think through and document. Do we support any cast or only safe casts? If a value can't be cast, do we pass null like ANSI behavior?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731460293


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36053/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729994542


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731428529


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131447/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813913663


   ```
   [info] org.apache.spark.sql.TPCDSQueryTestSuite *** ABORTED *** (1 second, 198 milliseconds)
   [info]   java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
   ```
   
   I saw this on other PRs as well. @maropu do you have any clue about it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813676379


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41496/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-805365019


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587671933



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {
+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * <p>
+   * For example, a "plus" function may return {@link IntegerType} when it is bound to arguments
+   * that are also {@link IntegerType}.
+   *
+   * @return a data type for values produced by this function
+   */
+  DataType resultType();
+
+  /**
+   * Returns the whether values produced by this function may be null.
+   * <p>
+   * For example, a "plus" function may return false when it is bound to arguments that are always
+   * non-null, but true when either argument may be null.
+   *
+   * @return true if values produced by this function may be null, false otherwise
+   */
+  default boolean isResultNullable() {
+    return true;
+  }
+
+  /**
+   * Returns whether this function result is deterministic.
+   * <p>
+   * By default, functions are assumed to be deterministic. Functions that are not deterministic
+   * should override this method so that Spark can ensure the function runs only once for a given
+   * input.
+   *
+   * @return true if this function is deterministic, false otherwise
+   */
+  default boolean isDeterministic() {
+    return true;
+  }
+
+  /**
+   * Returns the canonical name of this function, used to determine if functions are equivalent.
+   * <p>
+   * The canonical name is used to determine whether two functions are the same when loaded by
+   * different catalogs. For example, the same catalog implementation may be used for by two
+   * environments, "prod" and "test". Functions produced by the catalogs may be equivalent, but
+   * loaded using different names, like "test.func_name" and "prod.func_name".
+   * <p>
+   * Names returned by this function should be unique and unlikely to conflict with similar
+   * functions in other catalogs. For example, many catalogs may define a "bucket" function with a
+   * different implementation. Adding context, like "com.mycompany.bucket(string)", is recommended
+   * to avoid unintentional collisions.
+   *
+   * @return a canonical name for this function
+   */
+  default String canonicalName() {
+    // by default, use a random UUID so a function is never equivalent to another, even itself.
+    // this method is not required so that generated implementations (or careless ones) are not
+    // added and forgotten. for example, returning "" as a place-holder could cause unnecessary
+    // bugs if not replaced before release.
+    return UUID.randomUUID().toString();
+  }

Review comment:
       I think we will add a new expression to call the new UDF, so we can do whatever we want.
   
   BTW we recently added the ANSI type coercion framework in https://github.com/apache/spark/pull/31349 , so it's very easy to implement safe/ANSI type coercion for the new UDF expression.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-781663402


   Hi, All.
   According to the dev mailing discussion, we reach a general agreement on catalog functionality itself (except the invocation part). In order to finalize the agreement and narrow down the future part, let's merge this PR with commenting out the following in `ScalarFunction.java`. If you think that's okay, could you update once more, @rdblue ?
   
   ```scala
   -  R produceResult(InternalRow input);
   +  // R produceResult(InternalRow input);
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731451923


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36051/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731506170


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] holdenk commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r500624589



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/AggregateFunction.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalog.v2;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);

Review comment:
       Would it make sense to have a default implementation for taking an iterator of internalrows? Just thinking out loud.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/AggregateFunction.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalog.v2;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.

Review comment:
       Just thinking back to our initial groupByKey impl, can we add a warning here that if folks do not implement the `AssociativeAggregateFunction` they are going to force all values for a key onto a single node.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731486753






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731451918






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-801336902


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40758/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730004650






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r588786051



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       `PartialAggregateFunction` already have this. Alternatively I'm wondering whether we can do:
   
   ```java
   ByteBuffer serialize(S state);
   S deserialize(ByteBuffer buf);
   ```
   following the `TypedImperativeAggregate` interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wmoustafa commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

wmoustafa commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730865485


   > I think we need a design doc for the UDF API. We need to think about ease-of-use and performance.
   
   @rdblue @cloud-fan What do you think of the [Transport](https://github.com/linkedin/transport) API? It is simple, wraps InternalRows in the case of Spark, and portable between Spark, Presto, Hive and Avro (and potentially other data formats, so UDFs can probably be pushed to the format layer)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731482932


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36060/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704581501


   **[Test build #129473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129473/testReport)** for PR 24559 at commit [`21a5f07`](https://github.com/apache/spark/commit/21a5f074e3b564a353da28901c8d6cb107ec04c2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704583801


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129473/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774369521


   **[Test build #134948 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134948/testReport)** for PR 24559 at commit [`8f1084e`](https://github.com/apache/spark/commit/8f1084eba9a122b25e7f0d8b6649a9e77eae30ba).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731502664


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36063/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731428524






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774371188


   **[Test build #134947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134947/testReport)** for PR 24559 at commit [`86c6ce7`](https://github.com/apache/spark/commit/86c6ce7483b701b43913e43948edf005cfe3dcab).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730010862






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731506166


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36063/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774369530


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134948/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731486607


   **[Test build #131454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131454/testReport)** for PR 24559 at commit [`7fc5ce8`](https://github.com/apache/spark/commit/7fc5ce87406e24030da8dff4eaaaf9854c7ff632).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808584279


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41160/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wmoustafa edited a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

wmoustafa edited a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732539897


   In case you are interested in taking a look into what a UDF in Transport boils down to in the case of Spark, you can check [com/linkedin/transport/spark/StdUdfWrapper.scala](https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala). This is an [example user-facing UDF](https://github.com/linkedin/transport/blob/wmoustafa-api-v1/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples/MapFromTwoArraysFunction.java) that is translated to the Expression API. You can think of it as an API on top of an API.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808625841


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136576/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800729668


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40714/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587672991



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can

Review comment:
       @dongjoon-hyun, @sunchao, @viirya, @dbtsai, @wmoustafa, what do you think?
   
   Sounds like we have 2 options:
   1. Combine the partial interface with the aggregate interface so everything is required to implement the partial methods
   2. Add a rule that fails all aggregates that don't support partial aggregation for now.
   
   The first option is the easiest, but it means that this API is less expressive. As @cloud-fan noted, partial aggregation doesn't make sense for all aggregate functions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811596875


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136783/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814526569


   **[Test build #136966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136966/testReport)** for PR 24559 at commit [`bb8f2aa`](https://github.com/apache/spark/commit/bb8f2aa2181ff3270b91d6f15f05e4197f13df32).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774415544


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134949/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809863561


   **[Test build #136672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136672/testReport)** for PR 24559 at commit [`249e7f9`](https://github.com/apache/spark/commit/249e7f98b907380b2111f1c51569c300334552b4).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue edited a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue edited a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-775349355


   @cloud-fan, @dongjoon-hyun, I'll start a thread for discussion on the dev list. Let's move discussion there.
   
   Also, please read through the design doc. Your question about Spark adding casts is already addressed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813631723


   Retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809759390


   **[Test build #136672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136672/testReport)** for PR 24559 at commit [`249e7f9`](https://github.com/apache/spark/commit/249e7f98b907380b2111f1c51569c300334552b4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r604277368



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.sql.types.StructType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {

Review comment:
       nit: I think it's no necessary to extend `Serializable` since `Function` already extends that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] daniel-goldstein commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

daniel-goldstein commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-1030687911


   Not sure if this is the best place for this, but we've encountered the binding failure multiple times in our own containerized environments and found this issue to be in containers that ended up getting entirely numeric hostnames. `getaddrinfo` (which I assume Java's `getLocalHost` uses), may sometimes misinterpret fully numeric hostnames [as IP addresses](https://bugzilla.redhat.com/show_bug.cgi?id=1059122). I'm assuming based off [this code](https://github.com/apache/spark/blob/c7c51bcab5cb067d36bccf789e0e4ad7f37ffb7c/core/src/main/scala/org/apache/spark/util/Utils.scala#L1013) that the `SPARK_LOCAL_IP` workaround circumvents this issue. Here's an easy replication of this error (though a bit contrived):
   
   ```
   docker run --hostname 2886795934 -e SPARK_MODE=master bitnami/spark:3.2.1
   ```
   
   In our case, setting a numeric hostname was our fault, and docker [explicitly rejects](https://github.com/moby/moby/blob/12f1b3ce43fe4aea5a41750bcc20f2a7dd67dbfc/pkg/stringid/stringid.go#L47) numeric hostnames, it seems for the same reason. I'm not very familiar with GA and from a quick browsing am unsure if this could ever happen, but thought it might be good to keep in mind if this continues to be a sporadic failure and whether or not Spark should be aware of this failure mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729994542






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774401013


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-815104215


   Thank you, @rdblue and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809759390


   **[Test build #136672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136672/testReport)** for PR 24559 at commit [`249e7f9`](https://github.com/apache/spark/commit/249e7f98b907380b2111f1c51569c300334552b4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wmoustafa commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

wmoustafa commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732539897


   In case you are interested in taking a look into what a UDF in Transport boils down to in the case of Spark, you can check this [class](https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731460028


   **[Test build #131454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131454/testReport)** for PR 24559 at commit [`7fc5ce8`](https://github.com/apache/spark/commit/7fc5ce87406e24030da8dff4eaaaf9854c7ff632).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811513472


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41366/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587484940



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can

Review comment:
       Spark requires all aggregate functions to support partial merge, because the distributed aggregate operator in Spark requires it. As a result, even aggregate functions like `collect_list` have implemented the partial merging logic, though this doesn't make sense to them.
   
   I don't think Spark can change this in the near future, and the current two UDAF APIs in Spark(`UserDefinedAggregateFunction` and `Aggregator`) also require implementing partial merge. I think we need to do the same here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809794803


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41254/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774380337


   **[Test build #134949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134949/testReport)** for PR 24559 at commit [`15e127f`](https://github.com/apache/spark/commit/15e127f67b50f1150a5ec2a5ebf7a89dc5aa4da4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587671933



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {
+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * <p>
+   * For example, a "plus" function may return {@link IntegerType} when it is bound to arguments
+   * that are also {@link IntegerType}.
+   *
+   * @return a data type for values produced by this function
+   */
+  DataType resultType();
+
+  /**
+   * Returns the whether values produced by this function may be null.
+   * <p>
+   * For example, a "plus" function may return false when it is bound to arguments that are always
+   * non-null, but true when either argument may be null.
+   *
+   * @return true if values produced by this function may be null, false otherwise
+   */
+  default boolean isResultNullable() {
+    return true;
+  }
+
+  /**
+   * Returns whether this function result is deterministic.
+   * <p>
+   * By default, functions are assumed to be deterministic. Functions that are not deterministic
+   * should override this method so that Spark can ensure the function runs only once for a given
+   * input.
+   *
+   * @return true if this function is deterministic, false otherwise
+   */
+  default boolean isDeterministic() {
+    return true;
+  }
+
+  /**
+   * Returns the canonical name of this function, used to determine if functions are equivalent.
+   * <p>
+   * The canonical name is used to determine whether two functions are the same when loaded by
+   * different catalogs. For example, the same catalog implementation may be used for by two
+   * environments, "prod" and "test". Functions produced by the catalogs may be equivalent, but
+   * loaded using different names, like "test.func_name" and "prod.func_name".
+   * <p>
+   * Names returned by this function should be unique and unlikely to conflict with similar
+   * functions in other catalogs. For example, many catalogs may define a "bucket" function with a
+   * different implementation. Adding context, like "com.mycompany.bucket(string)", is recommended
+   * to avoid unintentional collisions.
+   *
+   * @return a canonical name for this function
+   */
+  default String canonicalName() {
+    // by default, use a random UUID so a function is never equivalent to another, even itself.
+    // this method is not required so that generated implementations (or careless ones) are not
+    // added and forgotten. for example, returning "" as a place-holder could cause unnecessary
+    // bugs if not replaced before release.
+    return UUID.randomUUID().toString();
+  }

Review comment:
       I think we will add a new expression to call the new UDF, so we can do whatever we want.
   
   BTW we recently added the ANSI type coercion in https://github.com/apache/spark/pull/31349 , so it's very easy to implement safe/ANSI type coercion for the new UDF expression.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu edited a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

maropu edited a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814507882


   > I saw this on other PRs as well. @maropu do you have any clue about it?
   >> Looks like there is something wrong with the GA's DNS.. We encounter the same >> binding issue in Kyuubi too - yaooqinn/kyuubi#489
   
   Yea, it looks like a GA env issue. I saw the error message sometimes in the other PRs.... I think the workaround for now is just to re-run a GA job (I will try to find a solution if possible). FYI: @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800721970


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40714/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r527963417



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {

Review comment:
       Could you add the missing import, @rdblue ?
   ```scala
   [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala:23:38: not found: type SparkFunSuite
   [error] class AggregateFunctionSuite extends SparkFunSuite {
   ```

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {

Review comment:
       Could you add the missing **import**, @rdblue ?
   ```scala
   [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala:23:38: not found: type SparkFunSuite
   [error] class AggregateFunctionSuite extends SparkFunSuite {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731506172


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36063/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-801336902


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40758/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731428510


   **[Test build #131447 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131447/testReport)** for PR 24559 at commit [`b394e8b`](https://github.com/apache/spark/commit/b394e8b2c0c0cf11e572a9b13c763cf71d3ecd11).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704583787






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587678767



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       I agree that perf may not be a big issue here, but the serialization logic needs to be provided. How about
   ```
   public interface AggregateFunction<S extends Serializble, R>
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r596497034



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       One issue I found with the `Serializable` approach is that currently in Spark the `SerializerInstance` as well as `ExpressionEncoder` all require `ClassTag`, which is not available from Java. This makes it hard to reuse the existing machinery in Spark for the serialization/deserialization work. Another issue, which is reflected by the CI failure, is that simple classes such as:
   ```scala
   class IntAverage extends AggregateFunction[(Int, Int), Int]
   ```
   ~~will not work out-of-box, as `(Int, Int)` doesn't implement `Serializable`~~. 
   
   Edit: sorry `TupleN` does implement `Serializable` in Scala, and the issue is (it seems) we can't get a `AggregateFunction` from a `BoundFunction` with the `Serializable` constraint.
   
   The `ClassTag` constraint for `SerializerInstance` was added in #700 for supporting Scala Pickling as one of the serializer implementation but seems the PR never ended in Spark, so not quite sure if it is still needed today, although it would require change a public developer API. Thanks @viirya for having a offline discussion with me on this.
   
   Because of this, I'm wondering if it makes sense to replace the `Serializable` with something else, such as another method:
   ```java
   Encoder<S> encoder();
   ```
   This can be implemented pretty easily by Spark users with [`Encoders`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala). The approach is similar to the `udaf` API today. For Scala users, we can optionally provide another version of `AggregateFunction` in Scala with implicit, so users don't need to do this.
   
   Would like to hear your opinion on this @rdblue @cloud-fan 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r571504355



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {
+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * <p>
+   * For example, a "plus" function may return {@link IntegerType} when it is bound to arguments
+   * that are also {@link IntegerType}.
+   *
+   * @return a data type for values produced by this function
+   */
+  DataType resultType();
+
+  /**
+   * Returns the whether values produced by this function may be null.

Review comment:
       `Returns the whether` -> `Returns whether the`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731420356


   **[Test build #131445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131445/testReport)** for PR 24559 at commit [`aa39040`](https://github.com/apache/spark/commit/aa39040f0041ccddedd97f180484d989667f8c70).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730002474


   **[Test build #131305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131305/testReport)** for PR 24559 at commit [`2eec8e5`](https://github.com/apache/spark/commit/2eec8e5b587f780969f5cc6ace0789cacb20db7d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729994551


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131304/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r600015853



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       Agree that from users' point of view `Serializable` is more friendly. This also means (from what I can see), though, we cannot reuse the existing `Serializer` and `SerializerInstance` from Spark, which allows Spark users to choose between Java and Kryo or plug in their own implementations. 
   
   Since it seems that performance won't be a concern for this use case maybe we can just use plan Java serializer for the job.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       Agree that from users' point of view `Serializable` is more friendly. This also means (from what I can see), though, we cannot reuse the existing `Serializer` and `SerializerInstance` from Spark, which allows Spark users to choose between Java and Kryo or plug in their own implementations. 
   
   Since it seems that performance won't be a concern for this use case maybe we can just use plain Java serializer for the job.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731428524


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774370078


   **[Test build #134947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134947/testReport)** for PR 24559 at commit [`86c6ce7`](https://github.com/apache/spark/commit/86c6ce7483b701b43913e43948edf005cfe3dcab).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813649295


   **[Test build #136919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136919/testReport)** for PR 24559 at commit [`37439c6`](https://github.com/apache/spark/commit/37439c6069bc82c1944eb44dd2f61712d35a9e9b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-815420455


   🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800710515


   **[Test build #136132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136132/testReport)** for PR 24559 at commit [`c0e9209`](https://github.com/apache/spark/commit/c0e92092b3cca62d18ef11ad32787a43c949b148).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r605229535



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Function.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import java.io.Serializable;
+
+/**
+ * Base class for user-defined functions.
+ */
+public interface Function extends Serializable {
+
+  /**
+   * A name to identify this function. Implementations should provide a meaningful name, like the
+   * database and function name from the catalog.
+   */
+  String name();

Review comment:
       Not here because this is the name reported by the function for Spark to use in messages. Case sensitivity is a concern when loading. There, we can't change whether the catalog is case sensitive or not, so I'm not sure it would provide much value. If you think it does, then please follow up with a PR that adds some wording for the `TableCatalog` and `FunctionCatalog` interfaces.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809864355


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136672/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774369327


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39531/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r603628151



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
##########
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {
+  test("Test simple iavg(int)") {
+    val rows = Seq(InternalRow(2), InternalRow(2), InternalRow(2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", IntegerType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == 2)
+  }
+
+  test("Test simple iavg(long)") {
+    val bigValue = 9762097370951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+
+  test("Test associative iavg(long)") {
+    val bigValue = 7620099737951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val state1 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val state2 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val finalState = udaf.merge(state1, state2)
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+}
+
+object IntegralAverage extends UnboundFunction {
+  override def name(): String = "iavg"
+
+  override def bind(inputType: StructType): BoundFunction = {
+    if (inputType.fields.length > 1) {
+      throw new UnsupportedOperationException("Too many arguments")
+    }
+
+    if (inputType.fields(0).nullable) {
+      throw new UnsupportedOperationException("Nullable values are not supported")
+    }
+
+    inputType.fields(0).dataType match {
+      case _: IntegerType => IntAverage
+      case _: LongType => LongAverage
+      case dataType =>
+        throw new UnsupportedOperationException(s"Unsupported non-integral type: $dataType")
+    }
+  }
+
+  override def description(): String =
+    """iavg: produces an average using integer division
+      |  iavg(int not null) -> int
+      |  iavg(bigint not null) -> bigint""".stripMargin
+}
+
+object IntAverage extends AggregateFunction[(Int, Int), Int] {
+
+  override def inputTypes(): Array[DataType] = Array(IntegerType)
+
+  override def name(): String = "iavg"
+
+  override def newAggregationState(): (Int, Int) = (0, 0)
+
+  override def update(state: (Int, Int), input: InternalRow): (Int, Int) = {
+    val i = input.getInt(0)
+    state match {
+      case (_, 0) =>
+        (i, 1)
+      case (total, count) =>
+        (total + i, count + 1)
+    }
+  }
+
+  override def merge(leftState: (Int, Int), rightState: (Int, Int)): (Int, Int) = {
+    (leftState._1 + rightState._1, leftState._2 + rightState._2)
+  }
+
+  override def produceResult(state: (Int, Int)): Int = state._1 / state._2

Review comment:
       This is already handled and will result in an `ArithmeticException`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731451908


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36051/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-805300507


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136421/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731523709


   **[Test build #131457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131457/testReport)** for PR 24559 at commit [`7fc5ce8`](https://github.com/apache/spark/commit/7fc5ce87406e24030da8dff4eaaaf9854c7ff632).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774369327


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39531/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu edited a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

maropu edited a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814507882






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800712542


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136132/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808619047


   **[Test build #136576 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136576/testReport)** for PR 24559 at commit [`70583d8`](https://github.com/apache/spark/commit/70583d8f652537f13a88c3a70417af734ea147d2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730008976


   **[Test build #131306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131306/testReport)** for PR 24559 at commit [`1721075`](https://github.com/apache/spark/commit/17210758bb7db883b56b4afcce2704e785620771).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814440720


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41543/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r588788233



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can

Review comment:
       I'd probably prefer the first option if there aren't many opinions out there about this. It is disappointing that Spark doesn't support simpler aggregates, though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587670239



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       I don't think requiring the UDF to return an InternalRow is a good idea. That's one of the problems of the existing API that we want to fix.
   
   Keep in mind that this serialization will only occur when Spark needs to shuffle the partial aggregate state, so the cost of Java serialization is amortized over all of the values that are processed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gatorsmile commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

gatorsmile commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732547912


   Defining a proper public interface for external functions is critical to the community. This requires an investigation and compare the new interface with the other popular database/processing systems. 
   
   Also CC the SQL experts cc @bart-samwel @srielau 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731450659


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36053/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731460304






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731451918


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731422947


   **[Test build #131445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131445/testReport)** for PR 24559 at commit [`aa39040`](https://github.com/apache/spark/commit/aa39040f0041ccddedd97f180484d989667f8c70).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731420356


   **[Test build #131445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131445/testReport)** for PR 24559 at commit [`aa39040`](https://github.com/apache/spark/commit/aa39040f0041ccddedd97f180484d989667f8c70).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r588786051



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       `PartialAggregateFunction` already have this. Alternatively I'm wondering whether we can do:
   
   ```scala
   def serialize(state: S): ByteBuffer
   def deserialize(buf: ByteBuffer): S
   ```
   following the `TypedImperativeAggregate` interface.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can

Review comment:
       > I don't think Spark can change this in the near future
   
   Have we (i.e., Spark community) thought about removing this restriction in future? If there's no motivation to do this then IMO option 1) is better since otherwise having the `AggregateFunction` is a little confusing as ppl will always use `PartialAggregateFunction`. Otherwise, I'd go with option 2).
   
   With option 2) we can just simply reject non `PartialAggregateFunction` s right after they are loaded from the catalog right?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809788355


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41254/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r603623878



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
##########
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {
+  test("Test simple iavg(int)") {
+    val rows = Seq(InternalRow(2), InternalRow(2), InternalRow(2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", IntegerType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == 2)
+  }
+
+  test("Test simple iavg(long)") {
+    val bigValue = 9762097370951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+
+  test("Test associative iavg(long)") {
+    val bigValue = 7620099737951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val state1 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val state2 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val finalState = udaf.merge(state1, state2)
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+}
+
+object IntegralAverage extends UnboundFunction {
+  override def name(): String = "iavg"
+
+  override def bind(inputType: StructType): BoundFunction = {
+    if (inputType.fields.length > 1) {
+      throw new UnsupportedOperationException("Too many arguments")
+    }
+
+    if (inputType.fields(0).nullable) {
+      throw new UnsupportedOperationException("Nullable values are not supported")
+    }
+
+    inputType.fields(0).dataType match {
+      case _: IntegerType => IntAverage
+      case _: LongType => LongAverage
+      case dataType =>
+        throw new UnsupportedOperationException(s"Unsupported non-integral type: $dataType")
+    }
+  }
+
+  override def description(): String =
+    """iavg: produces an average using integer division
+      |  iavg(int not null) -> int
+      |  iavg(bigint not null) -> bigint""".stripMargin
+}
+
+object IntAverage extends AggregateFunction[(Int, Int), Int] {
+
+  override def inputTypes(): Array[DataType] = Array(IntegerType)
+
+  override def name(): String = "iavg"
+
+  override def newAggregationState(): (Int, Int) = (0, 0)
+
+  override def update(state: (Int, Int), input: InternalRow): (Int, Int) = {
+    val i = input.getInt(0)
+    state match {
+      case (_, 0) =>
+        (i, 1)
+      case (total, count) =>
+        (total + i, count + 1)
+    }
+  }
+
+  override def merge(leftState: (Int, Int), rightState: (Int, Int)): (Int, Int) = {
+    (leftState._1 + rightState._1, leftState._2 + rightState._2)
+  }
+
+  override def produceResult(state: (Int, Int)): Int = state._1 / state._2

Review comment:
       For a simple test?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774397966


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813649295


   **[Test build #136919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136919/testReport)** for PR 24559 at commit [`37439c6`](https://github.com/apache/spark/commit/37439c6069bc82c1944eb44dd2f61712d35a9e9b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729991671






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704583764


   **[Test build #129473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129473/testReport)** for PR 24559 at commit [`21a5f07`](https://github.com/apache/spark/commit/21a5f074e3b564a353da28901c8d6cb107ec04c2).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809794803


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41254/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731497814


   **[Test build #131457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131457/testReport)** for PR 24559 at commit [`7fc5ce8`](https://github.com/apache/spark/commit/7fc5ce87406e24030da8dff4eaaaf9854c7ff632).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732532535


   @dongjoon-hyun, out of curiosity, where did they say they are negative on this PR? I see one comment from @cloud-fan:
   
   > I think we need a design doc for the UDF API. We need to think about ease-of-use and performance.
   
   which the PR author agreed after that:
   
   > @cloud-fan, I agree that we will eventually want a doc. This is intended to get everyone thinking about what it will look like and what the performance would be.
   
   Did I maybe miss some in another discussion thread?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730004650






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813669734






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813768990


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136919/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811487701


   **[Test build #136783 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136783/testReport)** for PR 24559 at commit [`37439c6`](https://github.com/apache/spark/commit/37439c6069bc82c1944eb44dd2f61712d35a9e9b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814529276


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136966/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731523938






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731460028


   **[Test build #131454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131454/testReport)** for PR 24559 at commit [`7fc5ce8`](https://github.com/apache/spark/commit/7fc5ce87406e24030da8dff4eaaaf9854c7ff632).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814507882


   > I saw this on other PRs as well. @maropu do you have any clue about it?
   >> Looks like there is something wrong with the GA's DNS.. We encounter the same >> binding issue in Kyuubi too - yaooqinn/kyuubi#489
   
   Yea, it looks like a GA env issue. I saw the error message sometimes in the other PRs.... I think the workaround for now is just to re-run a GA job. FYI: @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808561164


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41160/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811487701


   **[Test build #136783 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136783/testReport)** for PR 24559 at commit [`37439c6`](https://github.com/apache/spark/commit/37439c6069bc82c1944eb44dd2f61712d35a9e9b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r603643390



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+import java.io.Serializable;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * For each input row, Spark will call an update method that corresponds to the
+ * {@link #inputTypes() input data types}. The expected JVM argument types must be the types used by
+ * Spark's InternalRow API. If no direct method is found or when not using codegen, Spark will call
+ * update with {@link InternalRow}.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * All implementations must support partial aggregation by implementing {@link #merge(S, S)} so
+ * that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all
+ * rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to
+ * produce the result.
+ * <p>
+ * Intermediate aggregation state must be {@link Serializable} so that state produced by parallel
+ * tasks can be sent to a single executor and merged to produce a final result.

Review comment:
       This is describing the behavior for a single group. That's why the final aggregation is done on a single executor. Each group is shuffled to one executor. While I think it is correct, I like your version better so I'll update it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-801374970


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136176/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-801374970


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136176/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730005486






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800729668


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40714/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu edited a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

maropu edited a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814507882


   > I saw this on other PRs as well. @maropu do you have any clue about it?
   >> Looks like there is something wrong with the GA's DNS.. We encounter the same >> binding issue in Kyuubi too - yaooqinn/kyuubi#489
   
   Yea, it looks like a GA env issue. I saw the same error message in the other test suite, e.g., `RateStreamProviderSuite`) I think the workaround for now is just to re-run a GA job (but, I will try to find a solution). FYI: @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774401013


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730011152






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730005486


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r603373356



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
##########
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {
+  test("Test simple iavg(int)") {
+    val rows = Seq(InternalRow(2), InternalRow(2), InternalRow(2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", IntegerType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == 2)
+  }
+
+  test("Test simple iavg(long)") {
+    val bigValue = 9762097370951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+
+  test("Test associative iavg(long)") {
+    val bigValue = 7620099737951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val state1 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val state2 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val finalState = udaf.merge(state1, state2)
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+}
+
+object IntegralAverage extends UnboundFunction {
+  override def name(): String = "iavg"
+
+  override def bind(inputType: StructType): BoundFunction = {
+    if (inputType.fields.length > 1) {
+      throw new UnsupportedOperationException("Too many arguments")
+    }
+
+    if (inputType.fields(0).nullable) {
+      throw new UnsupportedOperationException("Nullable values are not supported")
+    }
+
+    inputType.fields(0).dataType match {
+      case _: IntegerType => IntAverage
+      case _: LongType => LongAverage
+      case dataType =>
+        throw new UnsupportedOperationException(s"Unsupported non-integral type: $dataType")
+    }
+  }
+
+  override def description(): String =
+    """iavg: produces an average using integer division
+      |  iavg(int not null) -> int
+      |  iavg(bigint not null) -> bigint""".stripMargin
+}
+
+object IntAverage extends AggregateFunction[(Int, Int), Int] {
+
+  override def inputTypes(): Array[DataType] = Array(IntegerType)
+
+  override def name(): String = "iavg"
+
+  override def newAggregationState(): (Int, Int) = (0, 0)
+
+  override def update(state: (Int, Int), input: InternalRow): (Int, Int) = {
+    val i = input.getInt(0)
+    state match {
+      case (_, 0) =>
+        (i, 1)
+      case (total, count) =>
+        (total + i, count + 1)
+    }
+  }
+
+  override def merge(leftState: (Int, Int), rightState: (Int, Int)): (Int, Int) = {
+    (leftState._1 + rightState._1, leftState._2 + rightState._2)
+  }
+
+  override def produceResult(state: (Int, Int)): Int = state._1 / state._2

Review comment:
       we should handle empty input, where `state._2` is 0.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+import java.io.Serializable;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * For each input row, Spark will call an update method that corresponds to the
+ * {@link #inputTypes() input data types}. The expected JVM argument types must be the types used by
+ * Spark's InternalRow API. If no direct method is found or when not using codegen, Spark will call
+ * update with {@link InternalRow}.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * All implementations must support partial aggregation by implementing {@link #merge(S, S)} so
+ * that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all
+ * rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to
+ * produce the result.
+ * <p>
+ * Intermediate aggregation state must be {@link Serializable} so that state produced by parallel
+ * tasks can be sent to a single executor and merged to produce a final result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state; must be {@link Serializable}
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S extends Serializable, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate

Review comment:
       How about `This method is called one time for every group of values per task to initialize ...`?

##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
##########
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {
+  test("Test simple iavg(int)") {
+    val rows = Seq(InternalRow(2), InternalRow(2), InternalRow(2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", IntegerType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == 2)
+  }
+
+  test("Test simple iavg(long)") {
+    val bigValue = 9762097370951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val finalState = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+
+  test("Test associative iavg(long)") {
+    val bigValue = 7620099737951020L
+    val rows = Seq(InternalRow(bigValue + 2), InternalRow(bigValue), InternalRow(bigValue - 2))
+
+    val bound = IntegralAverage.bind(new StructType().add("foo", LongType, nullable = false))
+    assert(bound.isInstanceOf[AggregateFunction[_, _]])
+    val udaf = bound.asInstanceOf[AggregateFunction[Serializable, _]]
+
+    val state1 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val state2 = rows.foldLeft(udaf.newAggregationState()) { (state, row) =>
+      udaf.update(state, row)
+    }
+    val finalState = udaf.merge(state1, state2)
+
+    assert(udaf.produceResult(finalState) == bigValue)
+  }
+}
+
+object IntegralAverage extends UnboundFunction {
+  override def name(): String = "iavg"
+
+  override def bind(inputType: StructType): BoundFunction = {
+    if (inputType.fields.length > 1) {
+      throw new UnsupportedOperationException("Too many arguments")
+    }
+
+    if (inputType.fields(0).nullable) {
+      throw new UnsupportedOperationException("Nullable values are not supported")
+    }
+
+    inputType.fields(0).dataType match {
+      case _: IntegerType => IntAverage
+      case _: LongType => LongAverage
+      case dataType =>
+        throw new UnsupportedOperationException(s"Unsupported non-integral type: $dataType")
+    }
+  }
+
+  override def description(): String =
+    """iavg: produces an average using integer division
+      |  iavg(int not null) -> int
+      |  iavg(bigint not null) -> bigint""".stripMargin
+}
+
+object IntAverage extends AggregateFunction[(Int, Int), Int] {
+
+  override def inputTypes(): Array[DataType] = Array(IntegerType)
+
+  override def name(): String = "iavg"
+
+  override def newAggregationState(): (Int, Int) = (0, 0)
+
+  override def update(state: (Int, Int), input: InternalRow): (Int, Int) = {
+    val i = input.getInt(0)
+    state match {
+      case (_, 0) =>

Review comment:
       This seems to be covered by `case (total, count) => (total + i, count + 1)` already.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+import java.io.Serializable;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * For each input row, Spark will call an update method that corresponds to the
+ * {@link #inputTypes() input data types}. The expected JVM argument types must be the types used by
+ * Spark's InternalRow API. If no direct method is found or when not using codegen, Spark will call
+ * update with {@link InternalRow}.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * All implementations must support partial aggregation by implementing {@link #merge(S, S)} so
+ * that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all
+ * rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to
+ * produce the result.
+ * <p>
+ * Intermediate aggregation state must be {@link Serializable} so that state produced by parallel
+ * tasks can be sent to a single executor and merged to produce a final result.

Review comment:
       Spark's aggregate operator is distributed (if there are grouping columns). So `can be serialized and shuffled` is more accurate than `sent to a single executor`, which only happens with global aggregate (no grouping columns).

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {
+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * <p>
+   * For example, a "plus" function may return {@link IntegerType} when it is bound to arguments
+   * that are also {@link IntegerType}.
+   *
+   * @return a data type for values produced by this function
+   */
+  DataType resultType();
+
+  /**
+   * Returns the whether values produced by this function may be null.

Review comment:
       +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

yaooqinn commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814253301


   > ```
   > [info] org.apache.spark.sql.TPCDSQueryTestSuite *** ABORTED *** (1 second, 198 milliseconds)
   > [info]   java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
   > ```
   > 
   > I saw this on other PRs as well. @maropu do you have any clue about it?
   
   Looks like there is something wrong with the GA's DNS.. We encounter the same binding issue in Kyuubi too - https://github.com/yaooqinn/kyuubi/pull/489 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729992086


   **[Test build #131304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131304/testReport)** for PR 24559 at commit [`b3ba28d`](https://github.com/apache/spark/commit/b3ba28d95064f6e532680c88264e8b352c2e3e60).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809864355


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136672/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-805307600


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-805300507


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136421/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731452723


   Sorry for asking you again, @rdblue . Could you fix the build once more? At this time, CI complains at `Unused import` sadly.
   ```
   [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala:72:68: Unused import
   [error]   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.IdentifierHelper
   [error]                                                                    ^
   [error] one error found
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r527974711



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {

Review comment:
       Will do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] yaooqinn commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

yaooqinn commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r603745517



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Function.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import java.io.Serializable;
+
+/**
+ * Base class for user-defined functions.
+ */
+public interface Function extends Serializable {
+
+  /**
+   * A name to identify this function. Implementations should provide a meaningful name, like the
+   * database and function name from the catalog.
+   */
+  String name();

Review comment:
       Shall we mention that function implementations may be case sensitive or case insensitive？




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732535056


   I myself prefer to have a design doc to prevent that we just go ahead and get lost what we intended to do. I had very long offline/online discussions related to the expression v2 APIs too with @rdblue, @brkyvz, @zero323, etc. We have an incomplete interface there which has been years untouched.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731422974


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131445/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774379653


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134947/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r588786051



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       `PartialAggregateFunction` already have this. Alternatively I'm wondering whether we can do:
   
   ```java
   ByteBuffer serialize(state: S);
   S deserialize(buf: ByteBuffer);
   ```
   following the `TypedImperativeAggregate` interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731426534


   **[Test build #131447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131447/testReport)** for PR 24559 at commit [`b394e8b`](https://github.com/apache/spark/commit/b394e8b2c0c0cf11e572a9b13c763cf71d3ecd11).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774367634


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39530/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-805365019


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774368595


   **[Test build #134948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134948/testReport)** for PR 24559 at commit [`8f1084e`](https://github.com/apache/spark/commit/8f1084eba9a122b25e7f0d8b6649a9e77eae30ba).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808516859


   **[Test build #136576 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136576/testReport)** for PR 24559 at commit [`70583d8`](https://github.com/apache/spark/commit/70583d8f652537f13a88c3a70417af734ea147d2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808585216


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41160/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-809791090


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41254/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774368595


   **[Test build #134948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134948/testReport)** for PR 24559 at commit [`8f1084e`](https://github.com/apache/spark/commit/8f1084eba9a122b25e7f0d8b6649a9e77eae30ba).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704585438






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730010862


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131306/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774410185


   **[Test build #134949 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134949/testReport)** for PR 24559 at commit [`15e127f`](https://github.com/apache/spark/commit/15e127f67b50f1150a5ec2a5ebf7a89dc5aa4da4).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731506170






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730004636


   **[Test build #131305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131305/testReport)** for PR 24559 at commit [`2eec8e5`](https://github.com/apache/spark/commit/2eec8e5b587f780969f5cc6ace0789cacb20db7d).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731482946


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800712542


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136132/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r589253326



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can

Review comment:
       It's very hard to support this, as one aggregate operator may host many aggregate functions that some support partial merging but some do not. I'm +1 to option 1.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731486758


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131454/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731460304






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731523938






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r528938167



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/AggregateFunction.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalog.v2;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);

Review comment:
       Is there value for a function in being able to control iteration? And can Spark support it if there is?
   
   I think there could be value for a function like `limit` because the source could stop iteration early. But, I'm not sure what effect that would have on Spark to have an iterator that is not exhausted. Overall, I think there aren't very many cases where controlling iteration in the function has enough value to warrant the additional complexity in the API.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731486753


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731422968






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774380337


   **[Test build #134949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134949/testReport)** for PR 24559 at commit [`15e127f`](https://github.com/apache/spark/commit/15e127f67b50f1150a5ec2a5ebf7a89dc5aa4da4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-805353043


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r605231413



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/FunctionCatalog.java
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import org.apache.spark.sql.catalyst.analysis.NoSuchFunctionException;
+import org.apache.spark.sql.catalyst.analysis.NoSuchNamespaceException;
+import org.apache.spark.sql.connector.catalog.functions.UnboundFunction;
+
+/**
+ * Catalog methods for working with Functions.
+ */
+public interface FunctionCatalog extends CatalogPlugin {
+
+  /**
+   * List the functions in a namespace from the catalog.

Review comment:
       Updated.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.sql.types.StructType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r588787875



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       Yes, @sunchao is right. This requirement is added by `PartialAggregateFunction` so merging the two interfaces would pick up that change, if we want to merge the two.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731442994


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36051/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r599968729



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       I don't think it is a good idea to use encoders. That brings in an additional API that is specific to Spark. There's nothing in practice that should prevent using Java's `Serializable` and that's a simple and well-understood way to do what we need.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731423134


   > What do you think of the Transport API?
   
   I think it's great to have people working on APIs for maintaining UDF libraries across projects.
   
   You may be wondering whether I think we should use that to call UDFs. I don't think that we would want to build support for a generic framework into Spark itself. I think Spark's API should be specific to Spark, just like the data source APIs are specific to Spark. That avoids complications like converting to Row or another representation for Hive.
   
   It should be possible to build a library using Transport that plugs in through this API, though. And it is great to have you looking at this and thinking about how it may be limited by the choices here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811596875


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136783/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-811585210


   **[Test build #136783 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136783/testReport)** for PR 24559 at commit [`37439c6`](https://github.com/apache/spark/commit/37439c6069bc82c1944eb44dd2f61712d35a9e9b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814440720


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41543/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774379653


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134947/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu edited a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

maropu edited a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814507882


   > I saw this on other PRs as well. @maropu do you have any clue about it?
   >> Looks like there is something wrong with the GA's DNS.. We encounter the same >> binding issue in Kyuubi too - yaooqinn/kyuubi#489
   
   Yea, it looks like a GA env issue. I saw the error message sometimes in the other PRs.... I think the workaround for now is just to re-run a GA job (but, I will try to find a solution). FYI: @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813768990


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136919/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r603628937



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {
+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * <p>
+   * For example, a "plus" function may return {@link IntegerType} when it is bound to arguments
+   * that are also {@link IntegerType}.
+   *
+   * @return a data type for values produced by this function
+   */
+  DataType resultType();
+
+  /**
+   * Returns the whether values produced by this function may be null.

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

xkrogen commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r604260231



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/UnboundFunction.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * Represents a user-defined function that is not bound to input types.
+ */
+public interface UnboundFunction extends Function {
+
+  /**
+   * Bind this function to an input type.
+   * <p>
+   * If the input type is not supported, implementations must throw
+   * {@link UnsupportedOperationException}.
+   * <p>
+   * For example, a "length" function that only supports a single string argument should throw
+   * UnsupportedOperationException if the struct has more than one field or if that field is not a
+   * string, and it may optionally throw if the field is nullable.
+   *
+   * @param inputType a struct type for inputs that will be passed to the bound function
+   * @return a function that can process rows with the given input type
+   * @throws UnsupportedOperationException If the function cannot be applied to the input type
+   */
+  BoundFunction bind(StructType inputType);
+
+  /**
+   * Returns Function documentation.

Review comment:
       Minor nit -- the method is called "description", should we say that this returns a description of the function (as opposed to "documentation")?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774370078


   **[Test build #134947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134947/testReport)** for PR 24559 at commit [`86c6ce7`](https://github.com/apache/spark/commit/86c6ce7483b701b43913e43948edf005cfe3dcab).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-729991671






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732425701


   Hi, @cloud-fan and @gatorsmile .
   Are you guys still negative on this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wmoustafa commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

wmoustafa commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-732539259


   > I myself prefer to have a design doc to prevent that we just go ahead and get lost what we intended to do. I had very long offline/online discussions related to the expression v2 APIs too with @rdblue, @brkyvz, @zero323, etc. We have an incomplete interface there which has been years untouched.
   
   Thanks for sharing this. So far Transport is leveraging the Expression API under the hood. Would be great to keep backward compatibility with any new API.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774387534


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800710515


   **[Test build #136132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136132/testReport)** for PR 24559 at commit [`c0e9209`](https://github.com/apache/spark/commit/c0e92092b3cca62d18ef11ad32787a43c949b148).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704583787


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730002474


   **[Test build #131305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131305/testReport)** for PR 24559 at commit [`2eec8e5`](https://github.com/apache/spark/commit/2eec8e5b587f780969f5cc6ace0789cacb20db7d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r587498350



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.IntegerType;
+
+import java.io.Serializable;
+import java.util.UUID;
+
+/**
+ * Represents a function that is bound to an input type.
+ */
+public interface BoundFunction extends Function, Serializable {
+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * <p>
+   * For example, a "plus" function may return {@link IntegerType} when it is bound to arguments
+   * that are also {@link IntegerType}.
+   *
+   * @return a data type for values produced by this function
+   */
+  DataType resultType();
+
+  /**
+   * Returns the whether values produced by this function may be null.
+   * <p>
+   * For example, a "plus" function may return false when it is bound to arguments that are always
+   * non-null, but true when either argument may be null.
+   *
+   * @return true if values produced by this function may be null, false otherwise
+   */
+  default boolean isResultNullable() {
+    return true;
+  }
+
+  /**
+   * Returns whether this function result is deterministic.
+   * <p>
+   * By default, functions are assumed to be deterministic. Functions that are not deterministic
+   * should override this method so that Spark can ensure the function runs only once for a given
+   * input.
+   *
+   * @return true if this function is deterministic, false otherwise
+   */
+  default boolean isDeterministic() {
+    return true;
+  }
+
+  /**
+   * Returns the canonical name of this function, used to determine if functions are equivalent.
+   * <p>
+   * The canonical name is used to determine whether two functions are the same when loaded by
+   * different catalogs. For example, the same catalog implementation may be used for by two
+   * environments, "prod" and "test". Functions produced by the catalogs may be equivalent, but
+   * loaded using different names, like "test.func_name" and "prod.func_name".
+   * <p>
+   * Names returned by this function should be unique and unlikely to conflict with similar
+   * functions in other catalogs. For example, many catalogs may define a "bucket" function with a
+   * different implementation. Adding context, like "com.mycompany.bucket(string)", is recommended
+   * to avoid unintentional collisions.
+   *
+   * @return a canonical name for this function
+   */
+  default String canonicalName() {
+    // by default, use a random UUID so a function is never equivalent to another, even itself.
+    // this method is not required so that generated implementations (or careless ones) are not
+    // added and forgotten. for example, returning "" as a place-holder could cause unnecessary
+    // bugs if not replaced before release.
+    return UUID.randomUUID().toString();
+  }

Review comment:
       The function argument type coercion is a very fundamental feature. Do we really need to make it a mixin trait as described in the design doc? How about we always require users to implement an API `DataType[] inputTypes();`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814754136


   Jenkins has passed and the GA failure is unrelated. I'm merging it to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731482949


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36060/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-730011179


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35909/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-813761457


   **[Test build #136919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136919/testReport)** for PR 24559 at commit [`37439c6`](https://github.com/apache/spark/commit/37439c6069bc82c1944eb44dd2f61712d35a9e9b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808585216


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41160/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-775349355


   @cloud-fan, @dongjoon-hyun, I'll start a thread for discussion on the dev list. Let's move discussion there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704585438


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-814529276


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136966/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-704581501


   **[Test build #129473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129473/testReport)** for PR 24559 at commit [`21a5f07`](https://github.com/apache/spark/commit/21a5f074e3b564a353da28901c8d6cb107ec04c2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-800712523


   **[Test build #136132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136132/testReport)** for PR 24559 at commit [`c0e9209`](https://github.com/apache/spark/commit/c0e92092b3cca62d18ef11ad32787a43c949b148).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class NoSuchFunctionException(`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774367634


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39530/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r596497034



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * Most implementations should also implement {@link PartialAggregateFunction} so that Spark can
+ * partially aggregate and shuffle intermediate results, instead of shuffling all rows for an
+ * aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the
+ * result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate
+   * aggregation state. More than one intermediate aggregation state variable may be used when the
+   * aggregation is run in parallel tasks.
+   * <p>
+   * The object returned may passed to {@link #update(Object, InternalRow)},
+   * and {@link #produceResult(Object)}. Implementations that return null must support null state
+   * passed into all other methods.
+   *
+   * @return a state instance or null
+   */
+  S newAggregationState();
+
+  /**
+   * Update the aggregation state with a new row.
+   * <p>
+   * This is called for each row in a group to update an intermediate aggregation state.
+   *
+   * @param state intermediate aggregation state
+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);
+
+  /**
+   * Produce the aggregation result based on intermediate state.
+   *
+   * @param state intermediate aggregation state
+   * @return a result value
+   */
+  R produceResult(S state);
+

Review comment:
       One issue I found with the `Serializable` approach is that currently in Spark the `SerializerInstance` as well as `ExpressionEncoder` all require `ClassTag`, which is not available from Java. This makes it hard to reuse the existing machinery in Spark for the serialization/deserialization work. Another issue, which is reflected by the CI failure, is that simple classes such as:
   ```scala
   class IntAverage extends AggregateFunction[(Int, Int), Int]
   ```
   will not work out-of-box, as `(Int, Int)` doesn't implement `Serializable`
   
   The `ClassTag` constraint for `SerializerInstance` was added in #700 for supporting Scala Pickling as one of the serializer implementation but seems the PR never ended in Spark, so not quite sure if it is still needed today. Thanks @viirya for having a offline discussion with me on this.
   
   Because of this, I'm wondering if it makes sense to replace the `Serializable` with something else, such as another method:
   ```java
   Encoder<S> encoder();
   ```
   This can be implemented pretty easily by Spark users with [`Encoders`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala). The approach is similar to the `udaf` API today. For Scala users, we can optionally provide another version of `AggregateFunction` in Scala with implicit, so users don't need to do this.
   
   Would like to hear your opinion on this @rdblue @cloud-fan 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-805267786


   **[Test build #136421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136421/testReport)** for PR 24559 at commit [`42c36d0`](https://github.com/apache/spark/commit/42c36d091a4ed41d208ed171e9a9a53f1a61e47f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774945050


   @rdblue Thanks for writing up the design doc! This is a very important and useful feature, and the `UnboundFunction` seems like a very interesting idea. It allows function overload (for different input schema, people can return different `BoundFunction`), but I'm wondering how it can suggest Spark to add Cast. For example, if a function accepts int type input, but the actual input is byte type.
   
   Another point is we should think of the final generated java code when invoking UDF. With whole-stage-codegen (the default case), the input values are actually java variables in the generated java code. It means we need to build an `InternalRow` before invoking the new UDF, which is very inefficient and is even worse than the current Spark Scala/Java UDF. Also, the type parameter of the return type has perf issues because of primitive type boxing.
   
   My rough idea is
   ```
   interface ScalarFunction {
     StructType[] expectedInputTypes();
     DataType returnType();
   }
   
   class MyScalaFunction implements ScalarFunction {
     StructType[] expectedInputTypes() { // ... allows int and string }
     DataType returnType() { return IntegerType; }
   
     int call(int arg) { return String.valueOf(arg).length(); }
     int call(UTF8String arg) { return arg.length(); }
   }
   ```
   The analyzer will bind the UDF with actual input types (add implicit cast if needed), and check if the `call` method exits
    for certain input/return types via reflection. Then in whole-stage-codegen, we just call the `call` method with certain type of inputs, and assign the result to a java variable. No need to build `InternalRow`, no boxing overhead, but no compile-time type safety (analyzer can still catch errors).
   
   cc @viirya @maropu @kiszk @rednaxelafx 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731477816


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36060/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731422968


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #24559:
URL: https://github.com/apache/spark/pull/24559#discussion_r603629652



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
##########
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.functions;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+
+import java.io.Serializable;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.
+ * <p>
+ * For each input row, Spark will call an update method that corresponds to the
+ * {@link #inputTypes() input data types}. The expected JVM argument types must be the types used by
+ * Spark's InternalRow API. If no direct method is found or when not using codegen, Spark will call
+ * update with {@link InternalRow}.
+ * <p>
+ * The JVM type of result values produced by this function must be the type used by Spark's
+ * InternalRow API for the {@link DataType SQL data type} returned by {@link #resultType()}.
+ * <p>
+ * All implementations must support partial aggregation by implementing {@link #merge(S, S)} so
+ * that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all
+ * rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to
+ * produce the result.
+ * <p>
+ * Intermediate aggregation state must be {@link Serializable} so that state produced by parallel
+ * tasks can be sent to a single executor and merged to produce a final result.
+ *
+ * @param <S> the JVM type for the aggregation's intermediate state; must be {@link Serializable}
+ * @param <R> the JVM type of result values
+ */
+public interface AggregateFunction<S extends Serializable, R> extends BoundFunction {
+
+  /**
+   * Initialize state for an aggregation.
+   * <p>
+   * This method is called one or more times for every group of values to initialize intermediate

Review comment:
       This may be called twice for the same group if the group appears on multiple executors before the shuffle.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-731482946






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-808516859


   **[Test build #136576 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136576/testReport)** for PR 24559 at commit [`70583d8`](https://github.com/apache/spark/commit/70583d8f652537f13a88c3a70417af734ea147d2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org