You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2022/09/27 09:17:16 UTC
[spark] branch branch-3.2 updated: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.2 by this push:
new dc9041a9df0 [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
dc9041a9df0 is described below
commit dc9041a9df0f2f338de1c06c1485a0ab70ff3e16
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Tue Sep 27 02:14:22 2022 -0700
[SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
This PR aims to add a new legacy configuration to keep `grouping__id` value like the released Apache Spark 3.2 and 3.3.
Please note that this syntax is non-SQL standard and even Hive doesn't support it.
```SQL
hive> SELECT version();
OK
3.1.3 r4df4d75bf1e16fe0af75aad0b4179c34c07fc975
Time taken: 0.111 seconds, Fetched: 1 row(s)
hive> SELECT count(*), grouping__id from t GROUP BY a GROUPING SETS (b);
FAILED: SemanticException 1:63 [Error 10213]:
Grouping sets expression is not in GROUP BY key. Error encountered near token 'b'
```
SPARK-40218 fixed a bug caused by SPARK-34932 (at Apache Spark 3.2.0). As a side effect, `grouping__id` values are changed.
- Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.3.0.
```scala
scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
+--------+------------+
|count(1)|grouping__id|
+--------+------------+
| 1| 1|
| 1| 1|
+--------+------------+
```
- After SPARK-40218, Apache Spark 3.4.0, 3.3.1, 3.2.3
```scala
scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
+--------+------------+
|count(1)|grouping__id|
+--------+------------+
| 1| 2|
| 1| 2|
+--------+------------+
```
- This PR (Apache Spark 3.4.0, 3.3.1, 3.2.3)
```scala
scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
+--------+------------+
|count(1)|grouping__id|
+--------+------------+
| 1| 2|
| 1| 2|
+--------+------------+
scala> sql("set spark.sql.legacy.groupingIdWithAppendedUserGroupBy=true")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]scala>
sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
+--------+------------+
|count(1)|grouping__id|
+--------+------------+
| 1| 1|
| 1| 1|
+--------+------------+
```
No, this simply added back the previous behavior by the legacy configuration.
Pass the CIs.
Closes #38001 from dongjoon-hyun/SPARK-40562.
Authored-by: Dongjoon Hyun <do...@apache.org>
Signed-off-by: Dongjoon Hyun <do...@apache.org>
(cherry picked from commit 5c0ebf3d97ae49b6e2bd2096c2d590abf4d725bd)
Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
docs/sql-migration-guide.md | 2 ++
.../spark/sql/catalyst/expressions/grouping.scala | 18 ++++++++++++++----
.../scala/org/apache/spark/sql/internal/SQLConf.scala | 12 ++++++++++++
3 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index da132544780..93d45cdc408 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -105,6 +105,8 @@ license: |
- In Spark 3.2, date +/- interval with only day-time fields such as `date '2011-11-11' + interval 12 hours` returns timestamp. In Spark 3.1 and earlier, the same expression returns date. To restore the behavior before Spark 3.2, you can use `cast` to convert timestamp as date.
+ - Since Spark 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, and 3.2.2. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/bro [...]
+
## Upgrading from Spark SQL 3.0 to 3.1
- In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggrega [...]
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
index 22e25b31f2e..8856a3fe94b 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
@@ -154,12 +154,22 @@ case class GroupingSets(
// Note that, we must put `userGivenGroupByExprs` at the beginning, to preserve the order of
// grouping columns specified by users. For example, GROUP BY (a, b) GROUPING SETS (b, a), the
// final grouping columns should be (a, b).
- override def children: Seq[Expression] = userGivenGroupByExprs ++ flatGroupingSets
+ override def children: Seq[Expression] =
+ if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) {
+ flatGroupingSets ++ userGivenGroupByExprs
+ } else {
+ userGivenGroupByExprs ++ flatGroupingSets
+ }
+
override protected def withNewChildrenInternal(
newChildren: IndexedSeq[Expression]): GroupingSets =
- copy(
- userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length),
- flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length))
+ if (SQLConf.get.groupingIdWithAppendedUserGroupByEnabled) {
+ super.legacyWithNewChildren(newChildren).asInstanceOf[GroupingSets]
+ } else {
+ copy(
+ userGivenGroupByExprs = newChildren.take(userGivenGroupByExprs.length),
+ flatGroupingSets = newChildren.drop(userGivenGroupByExprs.length))
+ }
}
object GroupingSets {
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 33765619823..b28bfeee245 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -3154,6 +3154,15 @@ object SQLConf {
.booleanConf
.createWithDefault(false)
+ val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY =
+ buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy")
+ .internal()
+ .doc("When true, grouping_id() returns values based on grouping set columns plus " +
+ "user-given group-by expressions order like Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0.")
+ .version("3.2.3")
+ .booleanConf
+ .createWithDefault(false)
+
val PARQUET_INT96_REBASE_MODE_IN_WRITE =
buildConf("spark.sql.parquet.int96RebaseModeInWrite")
.internal()
@@ -4109,6 +4118,9 @@ class SQLConf extends Serializable with Logging {
def integerGroupingIdEnabled: Boolean = getConf(SQLConf.LEGACY_INTEGER_GROUPING_ID)
+ def groupingIdWithAppendedUserGroupByEnabled: Boolean =
+ getConf(SQLConf.LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY)
+
def metadataCacheTTL: Long = getConf(StaticSQLConf.METADATA_CACHE_TTL_SECONDS)
def coalesceBucketsInJoinEnabled: Boolean = getConf(SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org