You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2022/09/27 09:17:16 UTC

[spark] branch branch-3.2 updated: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
     new dc9041a9df0 [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
dc9041a9df0 is described below

commit dc9041a9df0f2f338de1c06c1485a0ab70ff3e16
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Tue Sep 27 02:14:22 2022 -0700

    [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
    
    This PR aims to add a new legacy configuration to keep `grouping__id` value like the released Apache Spark 3.2 and 3.3.
    
    Please note that this syntax is non-SQL standard and even Hive doesn't support it.
    
    ```SQL
    hive> SELECT version();
    OK
    3.1.3 r4df4d75bf1e16fe0af75aad0b4179c34c07fc975
    Time taken: 0.111 seconds, Fetched: 1 row(s)
    
    hive> SELECT count(*), grouping__id from t GROUP BY a GROUPING SETS (b);
    FAILED: SemanticException 1:63 [Error 10213]:
    Grouping sets expression is not in GROUP BY key. Error encountered near token 'b'
    ```
    
    SPARK-40218 fixed a bug caused by SPARK-34932 (at Apache Spark 3.2.0). As a side effect, `grouping__id` values are changed.
    
    - Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.3.0.
    ```scala
    scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           1|
    |       1|           1|
    +--------+------------+
    ```
    
    - After SPARK-40218, Apache Spark 3.4.0, 3.3.1, 3.2.3
    ```scala
    scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           2|
    |       1|           2|
    +--------+------------+
    ```
    
    - This PR (Apache Spark 3.4.0, 3.3.1, 3.2.3)
    ```scala
    scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           2|
    |       1|           2|
    +--------+------------+
    
    scala> sql("set spark.sql.legacy.groupingIdWithAppendedUserGroupBy=true")
    res1: org.apache.spark.sql.DataFrame = [key: string, value: string]scala>
    
    sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
    +--------+------------+
    |count(1)|grouping__id|
    +--------+------------+
    |       1|           1|
    |       1|           1|
    +--------+------------+
    ```
    
    No, this simply added back the previous behavior by the legacy configuration.
    
    Pass the CIs.
    
    Closes #38001 from dongjoon-hyun/SPARK-40562.
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
    (cherry picked from commit 5c0ebf3d97ae49b6e2bd2096c2d590abf4d725bd)
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
 docs/sql-migration-guide.md                            |  2 ++
 .../spark/sql/catalyst/expressions/grouping.scala      | 18 ++++++++++++++----
 .../scala/org/apache/spark/sql/internal/SQLConf.scala  | 12 ++++++++++++
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index da132544780..93d45cdc408 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -105,6 +105,8 @@ license: |
 
   - In Spark 3.2, date +/- interval with only day-time fields such as `date '2011-11-11' + interval 12 hours` returns timestamp. In Spark 3.1 and earlier, the same expression returns date. To restore the behavior before Spark 3.2, you can use `cast` to convert timestamp as date.
 
+  - Since Spark 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, and 3.2.2. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/bro [...]
+
 ## Upgrading from Spark SQL 3.0 to 3.1
 
   - In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggrega [...]
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
index 22e25b31f2e..8856a3fe94b 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
@@ -154,12 +154,22 @@ case class GroupingSets(
   // Note that, we must put `userGivenGroupByExprs` at the beginning, to preserve the order of
   // grouping columns specified by users. For example, GROUP BY (a, b) GROUPING SETS (b, a), the
   // final grouping columns should be (a, b).
-  override def children: Seq[Expression] = userGivenGroupByExprs ++ flatGroupingSets
+  override def children: Seq[Expression] =
+    if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) {
+      flatGroupingSets ++ userGivenGroupByExprs
+    } else {
+      userGivenGroupByExprs ++ flatGroupingSets
+    }
+
   override protected def withNewChildrenInternal(
       newChildren: IndexedSeq[Expression]): GroupingSets =
-    copy(
-      userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length),
-      flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length))
+    if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) {
+      super.legacyWithNewChildren(newChildren).asInstanceOf[GroupingSets]
+    } else {
+      copy(
+        userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length),
+        flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length))
+    }
 }
 
 object GroupingSets {
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 33765619823..b28bfeee245 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -3154,6 +3154,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+   val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY =
+    buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy")
+      .internal()
+      .doc("When true, grouping_id() returns values based on grouping set columns plus " +
+        "user-given group-by expressions order like Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0.")
+      .version("3.2.3")
+      .booleanConf
+      .createWithDefault(false)
+
   val PARQUET_INT96_REBASE_MODE_IN_WRITE =
     buildConf("spark.sql.parquet.int96RebaseModeInWrite")
       .internal()
@@ -4109,6 +4118,9 @@ class SQLConf extends Serializable with Logging {
 
   def integerGroupingIdEnabled: Boolean = getConf(SQLConf.LEGACY_INTEGER_GROUPING_ID)
 
+  def groupingIdWithAppendedUserGroupByEnabled: Boolean =
+    getConf(SQLConf.LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY)
+
   def metadataCacheTTL: Long = getConf(StaticSQLConf.METADATA_CACHE_TTL_SECONDS)
 
   def coalesceBucketsInJoinEnabled: Boolean = getConf(SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org