You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/15 12:14:05 UTC

[GitHub] [spark] beliefer opened a new pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

beliefer opened a new pull request #35531:
URL: https://github.com/apache/spark/pull/35531


   ### What changes were proposed in this pull request?
   https://github.com/apache/spark/pull/34936 support ANSI Aggregation Function: `percentile_cont`.
   Some mainstream database supports `percentile_cont` as window function show below:
   
   **Snowflake**
   https://docs.snowflake.com/en/sql-reference/functions/percentile_cont.html
   **Oracle**
   https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/PERCENTILE_CONT.html#GUID-CA259452-A565-41B3-A4F4-DD74B66CEDE0
   **H2**
   http://www.h2database.com/html/functions-aggregate.html#percentile_cont
   **Sybase**
   https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc01776.1601/doc/html/san1278453109663.html
   **Exasol**
   https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/percentile_cont.htm
   **Yellowbrick**
   https://www.yellowbrick.com/docs/2.2/ybd_sqlref/percentile_cont_window.html
   **Mariadb**
   https://mariadb.com/kb/en/percentile_cont/
   **Singlestore**
   https://docs.singlestore.com/db/v7.6/en/reference/sql-reference/window-functions/percentile_cont-and-median.html
   
   
   ### Why are the changes needed?
   Support ANSI aggregation function `percentile_cont` as window function
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'. New feature.
   
   
   ### How was this patch tested?
   New tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r822348692



##########
File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##########
@@ -890,7 +890,7 @@ primaryExpression
     | OVERLAY '(' input=valueExpression PLACING replace=valueExpression
       FROM position=valueExpression (FOR length=valueExpression)? ')'                          #overlay
     | PERCENTILE_CONT '(' percentage=valueExpression ')'
-      WITHIN GROUP '(' ORDER BY sortItem ')'                                                   #percentile
+      WITHIN GROUP '(' ORDER BY sortItem ')' ( OVER windowSpec)?                               #percentile

Review comment:
       According to the document of these database, Almost all databases tell me the window specification of `PERCENTILE_CONT` only have partition by.
   I can't find any other words about root cause or rationale.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

amaliujia commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1045143504


   I know it might be painful that in the past we don't do it but now why we are doing it, but having the function specification documented, and then discussed, before reviewing the code, would be a really helpful process for changing a user-facing SQL feature (either adding a new one or changing existing ones). Many SQL related project does it (e.g. Apache Calcite) 
   
   It is not that difficult: just write down function signature, argument types, behaviors, examples. It is more like document what you are expressing by your code. People can better know all key decisions and argue on specific ones, or catch what has been missed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r816384851



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -1835,11 +1835,18 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
   override def visitPercentile(ctx: PercentileContext): Expression = withOrigin(ctx) {
     val percentage = expression(ctx.percentage)
     val sortOrder = visitSortItem(ctx.sortItem)
-    val percentile = sortOrder.direction match {
-      case Ascending => new Percentile(sortOrder.child, percentage)
-      case Descending => new Percentile(sortOrder.child, Subtract(Literal(1), percentage))

Review comment:
       Yes. According the document, only could specify partition by clause in window.
   I tested in Oracle
   ```
   SELECT 
   percentile_cont(0.25) WITHIN GROUP (ORDER BY x) OVER (ORDER BY x ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
   percentile_cont(0.25) WITHIN GROUP (ORDER BY x DESC) OVER (PARTITION BY k)
   FROM SYSTEM.testRegression
   ORDER BY k;
   ```
   , it throws `SQL 错误 [30487] [99999]: ORA-30487: ORDER BY 在此禁用`.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1079335216


   +1, LGTM. Merged to master.
   Thank you, @beliefer and @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

MaxGekk closed pull request #35531:
URL: https://github.com/apache/spark/pull/35531


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1078950951


   Waiting for https://github.com/beliefer/spark/runs/5691324331 completes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r814827181



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -1835,11 +1835,18 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
   override def visitPercentile(ctx: PercentileContext): Expression = withOrigin(ctx) {
     val percentage = expression(ctx.percentage)
     val sortOrder = visitSortItem(ctx.sortItem)
-    val percentile = sortOrder.direction match {
-      case Ascending => new Percentile(sortOrder.child, percentage)
-      case Descending => new Percentile(sortOrder.child, Subtract(Literal(1), percentage))

Review comment:
       Can we directly use `Percentile`? It's an aggregate function and can be used in Window.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r837392391



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileCont.scala
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes, RuntimeReplaceableAggregate}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.types.AbstractDataType
+
+/**
+ * Return a percentile value based on a continuous distribution of
+ * numeric or ansi interval column at the given percentage (specified in ORDER BY clause).
+ * The value of percentage must be between 0.0 and 1.0.
+ */
+case class PercentileCont(left: Expression, right: Expression)
+  extends AggregateFunction
+  with RuntimeReplaceableAggregate
+  with ImplicitCastInputTypes

Review comment:
       We will use tree node tag to improve the implement of `PercentileCont` based discussion offline.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1079083970


   The test failure is not related to the changes:
   ```
   [info] - SPARK-37555: spark-sql should pass last unclosed comment to backend *** FAILED *** (2 minutes, 11 seconds)
   ...
   [info]   Exception: java.util.concurrent.TimeoutException: Futures timed out after [2 minutes]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1045629336


   > I know it might be painful that in the past we don't do it but now why we are doing it, but having the function specification documented, and then discussed, before reviewing the code, would be a really helpful process for changing a user-facing SQL feature (either adding a new one or changing existing ones). Many SQL related project does it (e.g. Apache Calcite)
   > 
   > It is not that difficult: just write down function signature, argument types, behaviors, examples. It is more like document what you are expressing by your code. People can better know all key decisions and argue on specific ones, or catch what has been missed.
   
   Updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r816384851



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -1835,11 +1835,18 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
   override def visitPercentile(ctx: PercentileContext): Expression = withOrigin(ctx) {
     val percentage = expression(ctx.percentage)
     val sortOrder = visitSortItem(ctx.sortItem)
-    val percentile = sortOrder.direction match {
-      case Ascending => new Percentile(sortOrder.child, percentage)
-      case Descending => new Percentile(sortOrder.child, Subtract(Literal(1), percentage))

Review comment:
       Yes. According the document, only could specify partition by clause in window.
   I tested Oracle, it throws `SQL 错误 [30487] [99999]: ORA-30487: ORDER BY 在此禁用`.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r821677819



##########
File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##########
@@ -890,7 +890,7 @@ primaryExpression
     | OVERLAY '(' input=valueExpression PLACING replace=valueExpression
       FROM position=valueExpression (FOR length=valueExpression)? ')'                          #overlay
     | PERCENTILE_CONT '(' percentage=valueExpression ')'
-      WITHIN GROUP '(' ORDER BY sortItem ')'                                                   #percentile
+      WITHIN GROUP '(' ORDER BY sortItem ')' ( OVER windowSpec)?                               #percentile

Review comment:
       And why can't users specify the window frame boundaries?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1050816749


   ping @MaxGekk cc @cloud-fan @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r821676658



##########
File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##########
@@ -890,7 +890,7 @@ primaryExpression
     | OVERLAY '(' input=valueExpression PLACING replace=valueExpression
       FROM position=valueExpression (FOR length=valueExpression)? ')'                          #overlay
     | PERCENTILE_CONT '(' percentage=valueExpression ')'
-      WITHIN GROUP '(' ORDER BY sortItem ')'                                                   #percentile
+      WITHIN GROUP '(' ORDER BY sortItem ')' ( OVER windowSpec)?                               #percentile

Review comment:
       I'd like to understand the rationale more. This function has its own syntax to specify ORDER BY, so I get that it should not co-exists with the ORDER BY in the window frame. But doesn't `percentile` have the same problem when being used as a window function?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r835094439



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileCont.scala
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes, RuntimeReplaceableAggregate}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.types.AbstractDataType
+
+/**
+ * Return a percentile value based on a continuous distribution of
+ * the input column (specified in ORDER BY clause).
+ */
+@ExpressionDescription(

Review comment:
       I had removed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r835090963



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileCont.scala
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes, RuntimeReplaceableAggregate}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.types.AbstractDataType
+
+/**
+ * Return a percentile value based on a continuous distribution of
+ * the input column (specified in ORDER BY clause).
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(col, percentage) - Returns a percentile value based on a continuous distribution of
+     numeric or ansi interval column `col` at the given percentage. The value of percentage must be
+     between 0.0 and 1.0.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 0.3) FROM VALUES (0), (10) AS tab(col);
+       3.0
+      > SELECT _FUNC_(col, 0.5) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH) AS tab(col);
+       5.0
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")

Review comment:
       OK




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1070355796


   @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r815961560



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -1835,11 +1835,18 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
   override def visitPercentile(ctx: PercentileContext): Expression = withOrigin(ctx) {
     val percentage = expression(ctx.percentage)
     val sortOrder = visitSortItem(ctx.sortItem)
-    val percentile = sortOrder.direction match {
-      case Ascending => new Percentile(sortOrder.child, percentage)
-      case Descending => new Percentile(sortOrder.child, Subtract(Literal(1), percentage))

Review comment:
       is this a real limitation or it's just other databases do not support it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r815258699



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -1835,11 +1835,18 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
   override def visitPercentile(ctx: PercentileContext): Expression = withOrigin(ctx) {
     val percentage = expression(ctx.percentage)
     val sortOrder = visitSortItem(ctx.sortItem)
-    val percentile = sortOrder.direction match {
-      case Ascending => new Percentile(sortOrder.child, percentage)
-      case Descending => new Percentile(sortOrder.child, Subtract(Literal(1), percentage))

Review comment:
       Because cannot specify order by or frame of window for 'PERCENTILE_CONT'.
   ```
                 case AggregateExpression(_: PercentileCont, _, _, _, _)
                    if w.windowSpec.orderSpec.nonEmpty || w.windowSpec.frameSpecification !=
                        SpecifiedWindowFrame(RowFrame, UnboundedPreceding, UnboundedFollowing) =>
                    failAnalysis("Cannot specify order by or frame for 'PERCENTILE_CONT'.")
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

beliefer commented on pull request #35531:
URL: https://github.com/apache/spark/pull/35531#issuecomment-1079542467


   @MaxGekk @cloud-fan Thank you for the help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r835054964



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileCont.scala
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes, RuntimeReplaceableAggregate}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.types.AbstractDataType
+
+/**
+ * Return a percentile value based on a continuous distribution of
+ * the input column (specified in ORDER BY clause).
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(col, percentage) - Returns a percentile value based on a continuous distribution of
+     numeric or ansi interval column `col` at the given percentage. The value of percentage must be
+     between 0.0 and 1.0.
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 0.3) FROM VALUES (0), (10) AS tab(col);
+       3.0
+      > SELECT _FUNC_(col, 0.5) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH) AS tab(col);
+       5.0
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")

Review comment:
       3.3.0 -> 3.4.0
   SPARK-38219 is not in the allow list https://lists.apache.org/thread/zrd7lcm5f5f3md7wffjy7x6w2pdmxxp7

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileCont.scala
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes, RuntimeReplaceableAggregate}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.types.AbstractDataType
+
+/**
+ * Return a percentile value based on a continuous distribution of
+ * the input column (specified in ORDER BY clause).
+ */
+@ExpressionDescription(

Review comment:
       If the expression is not bound to a function in `FunctionRegistry`, `ExpressionDescription` is useless. What's the purpose of the annotation?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35531: [SPARK-38219][SQL] Support ANSI aggregation function `percentile_cont` as window function

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35531:
URL: https://github.com/apache/spark/pull/35531#discussion_r837196858



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileCont.scala
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes, RuntimeReplaceableAggregate}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.types.AbstractDataType
+
+/**
+ * Return a percentile value based on a continuous distribution of
+ * numeric or ansi interval column at the given percentage (specified in ORDER BY clause).
+ * The value of percentage must be between 0.0 and 1.0.
+ */
+case class PercentileCont(left: Expression, right: Expression)
+  extends AggregateFunction
+  with RuntimeReplaceableAggregate
+  with ImplicitCastInputTypes

Review comment:
       Since we only use `PercentileCont` as a marker to indicate that window frame ORDER BY should not be used, I think `InheritAnalysisRules` is better here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org