You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/12/18 07:55:39 UTC

[PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

beliefer opened a new pull request, #44397:
URL: https://github.com/apache/spark/pull/44397

   ### What changes were proposed in this pull request?
   This PR will translate the aggregate function `PERCENTILE_CONT` and `PERCENTILE_DISC` for pushdown.
   
   This PR adds `Expression[] orderingWithinGroups` into `GeneralAggregateFunc`, so as DS V2 pushdown framework could compile the `WITHIN GROUP (ORDER BY ...)` easily.
   This PR also split `visitInverseDistributionFunction` from `visitAggregateFunction`, so as DS V2 pushdown framework could generate the syntax `WITHIN GROUP (ORDER BY ...)` easily.
   This PR also fix a but that `JdbcUtils` can't treat the precision and scale of decimal returned from JDBC.
   
   
   ### Why are the changes needed?
   DS V2 supports push down `Mode`
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   New test cases.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   'No'.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #44397:
URL: https://github.com/apache/spark/pull/44397#issuecomment-1877089070

   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #44397:
URL: https://github.com/apache/spark/pull/44397#issuecomment-1884182412

   @cloud-fan Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1429626009


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -402,18 +402,11 @@ object JdbcUtils extends Logging with SQLConfHelper {
           row.update(pos, null)
         }
 
-    // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
-    // object returned by ResultSet.getBigDecimal is not correctly matched to the table
-    // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
-    // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
-    // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
-    // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
-    // retrieve it, you will get wrong result 199.99.
-    // So it is needed to set precision and scale for Decimal based on JDBC metadata.
     case DecimalType.Fixed(p, s) =>
       (rs: ResultSet, row: InternalRow, pos: Int) =>
         val decimal =
-          nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s))

Review Comment:
   The origin code will throws exception.
   ```
   Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003
   	at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
   	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
   	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
   	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
   	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
   	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
   	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
   	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
   	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1444003277


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java:
##########
@@ -271,6 +279,15 @@ protected String visitAggregateFunction(
     }
   }
 
+  protected String visitInverseDistributionFunction(
+      String funcName, boolean isDistinct, String[] inputs, String[] orderingWithinGroups) {
+    assert(isDistinct == false);
+    String withinGroup =
+      joinArrayToString(orderingWithinGroups, ", ", "WITHIN GROUP (ORDER BY ", ")");

Review Comment:
   Please refer `visitSortOrder`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1446073030


##########
sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala:
##########
@@ -42,7 +42,7 @@ private[sql] object H2Dialect extends JdbcDialect {
 
   private val distinctUnsupportedAggregateFunctions =
     Set("COVAR_POP", "COVAR_SAMP", "CORR", "REGR_INTERCEPT", "REGR_R2", "REGR_SLOPE", "REGR_SXY",
-      "MODE")
+      "MODE", "PERCENTILE_CONT", "PERCENTILE_DISC")

Review Comment:
   Because H2 dialect overrides the `visitInverseDistributionFunction` and check with `isSupportedFunction`.
   
   ```
        override def visitInverseDistributionFunction(
            funcName: String,
            isDistinct: Boolean,
            inputs: Array[String],
            orderingWithinGroups: Array[String]): String = {
          if (isSupportedFunction(funcName)) {
            super.visitInverseDistributionFunction(
              dialectFunctionName(funcName), isDistinct, inputs, orderingWithinGroups)
          } else {
            throw new UnsupportedOperationException(
              s"${this.getClass.getSimpleName} does not support " +
                s"inverse distribution function: $funcName")
          }
        }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1443047767


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java:
##########
@@ -51,11 +53,21 @@ public final class GeneralAggregateFunc extends ExpressionWithToString implement
   private final String name;
   private final boolean isDistinct;
   private final Expression[] children;
+  private final Expression[] orderingWithinGroups;
 
   public GeneralAggregateFunc(String name, boolean isDistinct, Expression[] children) {
     this.name = name;
     this.isDistinct = isDistinct;
     this.children = children;
+    this.orderingWithinGroups = null;

Review Comment:
   empty array is a better default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #44397:
URL: https://github.com/apache/spark/pull/44397#issuecomment-1867281469

   ping @cloud-fan cc @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1445973126


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java:
##########
@@ -51,11 +53,21 @@ public final class GeneralAggregateFunc extends ExpressionWithToString implement
   private final String name;
   private final boolean isDistinct;
   private final Expression[] children;
+  private final Expression[] orderingWithinGroups;

Review Comment:
   shall we use `SortOrder[]`?



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java:
##########
@@ -64,6 +76,8 @@ public GeneralAggregateFunc(String name, boolean isDistinct, Expression[] childr
   @Override
   public Expression[] children() { return children; }
 
+  public Expression[] orderingWithinGroups() { return orderingWithinGroups; }

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1445975310


##########
sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala:
##########
@@ -42,7 +42,7 @@ private[sql] object H2Dialect extends JdbcDialect {
 
   private val distinctUnsupportedAggregateFunctions =
     Set("COVAR_POP", "COVAR_SAMP", "CORR", "REGR_INTERCEPT", "REGR_R2", "REGR_SLOPE", "REGR_SXY",
-      "MODE")
+      "MODE", "PERCENTILE_CONT", "PERCENTILE_DISC")

Review Comment:
   why do we need this change? H2 dialect deals with these two functions in `visitInverseDistributionFunction`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1443049206


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java:
##########
@@ -271,6 +279,15 @@ protected String visitAggregateFunction(
     }
   }
 
+  protected String visitInverseDistributionFunction(
+      String funcName, boolean isDistinct, String[] inputs, String[] orderingWithinGroups) {
+    assert(isDistinct == false);
+    String withinGroup =
+      joinArrayToString(orderingWithinGroups, ", ", "WITHIN GROUP (ORDER BY ", ")");

Review Comment:
   how do we translate ASC/DESC?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #44397:
URL: https://github.com/apache/spark/pull/44397#issuecomment-1884169996

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #44397: [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC
URL: https://github.com/apache/spark/pull/44397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #44397:
URL: https://github.com/apache/spark/pull/44397#issuecomment-1882394605

   The GA failure is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44397:
URL: https://github.com/apache/spark/pull/44397#discussion_r1429631348


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -402,18 +402,11 @@ object JdbcUtils extends Logging with SQLConfHelper {
           row.update(pos, null)
         }
 
-    // When connecting with Oracle DB through JDBC, the precision and scale of BigDecimal
-    // object returned by ResultSet.getBigDecimal is not correctly matched to the table
-    // schema reported by ResultSetMetaData.getPrecision and ResultSetMetaData.getScale.
-    // If inserting values like 19999 into a column with NUMBER(12, 2) type, you get through
-    // a BigDecimal object with scale as 0. But the dataframe schema has correct type as
-    // DecimalType(12, 2). Thus, after saving the dataframe into parquet file and then
-    // retrieve it, you will get wrong result 199.99.
-    // So it is needed to set precision and scale for Decimal based on JDBC metadata.
     case DecimalType.Fixed(p, s) =>
       (rs: ResultSet, row: InternalRow, pos: Int) =>
         val decimal =
-          nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s))

Review Comment:
   The precision is 38 and scale is 38 based on `DecimalType`.
   In fact, the decimal return from JDBC is `BigDecimal(7, 3)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org