You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/28 10:29:34 UTC

[GitHub] [spark] beliefer opened a new pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: percentile_disc

beliefer opened a new pull request #35041:
URL: https://github.com/apache/spark/pull/35041


   ### What changes were proposed in this pull request?
   `PERCENTILE_CONT` is an ANSI aggregate functions.
   
   The mainstream database supports `percentile_cont` show below:
   **Postgresql**
   https://www.postgresql.org/docs/9.4/functions-aggregate.html
   **Teradata**
   https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/cPkFySIBORL~M938Zv07Cg
   **Snowflake**
   https://docs.snowflake.com/en/sql-reference/functions/percentile_cont.html
   **Oracle**
   https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/PERCENTILE_CONT.html#GUID-CA259452-A565-41B3-A4F4-DD74B66CEDE0
   **H2**
   http://www.h2database.com/html/functions-aggregate.html#percentile_cont
   **Sybase**
   https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc01776.1601/doc/html/san1278453109663.html
   **Exasol**
   https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/percentile_cont.htm
   **RedShift**
   https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
   **Yellowbrick**
   https://www.yellowbrick.com/docs/2.2/ybd_sqlref/percentile_cont.html
   **Mariadb**
   https://mariadb.com/kb/en/percentile_cont/
   **Phoenix**
   http://phoenix.incubator.apache.org/language/functions.html#percentile_cont
   **Singlestore**
   https://docs.singlestore.com/db/v7.6/en/reference/sql-reference/window-functions/percentile_cont-and-median.html
   
   ### Why are the changes needed?
   `PERCENTILE_CONT` is very useful. Exposing the expression can make the migration from other systems to Spark SQL easier.
   
   ### Does this PR introduce _any_ user-facing change?
   'Yes'. New feature.
   
   
   ### How was this patch tested?
   New tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1081308706


   ping @MaxGekk again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #35041:
URL: https://github.com/apache/spark/pull/35041#discussion_r820026108



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala
##########
@@ -1301,7 +1301,7 @@ class PlanParserSuite extends AnalysisTest {
       "timestamp expression cannot contain subqueries")
   }
 
-  test("PERCENTILE_CONT function") {
+  test("PERCENTILE_CONT & PERCENTILE_DISC") {
     def assertPercentileContPlans(inputSQL: String, expectedExpression: Expression): Unit = {

Review comment:
       How about `assertPercentilePlans` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35041:
URL: https://github.com/apache/spark/pull/35041#discussion_r821441631



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala
##########
@@ -207,41 +240,49 @@ class PercentileSuite extends SparkFunSuite {
       CreateArray(Seq(0, 0.5, 1).map(Literal(_))))
 
     validPercentages.foreach { percentage =>
-      val percentile1 = new Percentile(child, percentage)
-      assertEqual(percentile1.checkInputDataTypes(), TypeCheckSuccess)
+      Seq(new Percentile(child, percentage),
+        new PercentileDisc(child, percentage)).foreach { percentile =>
+        assertEqual(percentile.checkInputDataTypes(), TypeCheckSuccess)
+      }
     }
 
     val invalidPercentages = Seq(Literal(-0.5), Literal(1.5), Literal(2D),
       CreateArray(Seq(-0.5, 0, 2).map(Literal(_))))
 
     invalidPercentages.foreach { percentage =>
-      val percentile2 = new Percentile(child, percentage)
-      assertEqual(percentile2.checkInputDataTypes(),
-        TypeCheckFailure(s"Percentage(s) must be between 0.0 and 1.0, " +
-        s"but got ${percentage.simpleString(100)}"))
+      Seq(new Percentile(child, percentage),
+        new PercentileDisc(child, percentage)).foreach { percentile =>

Review comment:
       We can have a small method like
   ```
   def withPercentiles(child: Expression, percentage)(f: PercentileBase => Unit) {
     Seq(new Percentile(child, percentage), ...).foreach...
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1059718815


   > The changes looks good, need extra time to review the `getPercentile` function.
   > 
   > Can you please improvement the test coverage by refactoring the test suite `PercentileSuite` ? Thanks!
   
   `PercentileSuite` has been updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1005619977


   @MaxGekk Could you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1029827038


   @jiangxb1987 Could you help to review the changes, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: percentile_disc

Posted by GitBox <gi...@apache.org>.
beliefer commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1002456224


   ping @MaxGekk @gengliangwang cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35041:
URL: https://github.com/apache/spark/pull/35041#discussion_r821440019



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala
##########
@@ -324,3 +337,117 @@ case class Percentile(
     frequencyExpression = newThird
   )
 }
+
+/**
+ * The Percentile aggregate function returns the percentile(s) based on a discrete distribution of
+ * numeric column `expr` at the given percentage(s) with value range in [0.0, 1.0].
+ *
+ * Because the number of elements and their partial order cannot be determined in advance.
+ * Therefore we have to store all the elements in memory, and so notice that too many elements can
+ * cause GC paused and eventually OutOfMemory Errors.
+ */

Review comment:
       We need to write SQL function doc for it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer edited a comment on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer edited a comment on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1081308706


   ping @MaxGekk @jiangxb1987  again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1079548540


   > @beliefer Could you resolve conflicts, please.
   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a change in pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer commented on a change in pull request #35041:
URL: https://github.com/apache/spark/pull/35041#discussion_r822329479



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala
##########
@@ -324,3 +337,117 @@ case class Percentile(
     frequencyExpression = newThird
   )
 }
+
+/**
+ * The Percentile aggregate function returns the percentile(s) based on a discrete distribution of
+ * numeric column `expr` at the given percentage(s) with value range in [0.0, 1.0].
+ *
+ * Because the number of elements and their partial order cannot be determined in advance.
+ * Therefore we have to store all the elements in memory, and so notice that too many elements can
+ * cause GC paused and eventually OutOfMemory Errors.
+ */

Review comment:
       OK




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
beliefer commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1022751663


   ping @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35041:
URL: https://github.com/apache/spark/pull/35041#issuecomment-1057747474


   @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on a change in pull request #35041: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on a change in pull request #35041:
URL: https://github.com/apache/spark/pull/35041#discussion_r819839061



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala
##########
@@ -1301,7 +1301,7 @@ class PlanParserSuite extends AnalysisTest {
       "timestamp expression cannot contain subqueries")
   }
 
-  test("PERCENTILE_CONT function") {
+  test("PERCENTILE_CONT & PERCENTILE_DISC") {
     def assertPercentileContPlans(inputSQL: String, expectedExpression: Expression): Unit = {

Review comment:
       nit: rename this function to `assertPercentileBasePlans` ?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala
##########
@@ -324,3 +337,100 @@ case class Percentile(
     frequencyExpression = newThird
   )
 }
+
+/**
+ * The Percentile aggregate function returns a percentile(s) based on a discrete distribution of

Review comment:
       `a percentile(s)` -> `the percentile(s)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org