You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/26 09:28:16 UTC

[GitHub] [spark] dongjoon-hyun opened a new pull request, #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

dongjoon-hyun opened a new pull request, #38001:
URL: https://github.com/apache/spark/pull/38001

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258433555

   Before SPARK-34932, Apache Spark fails at `GROUP BY a GROUPING SETS(a, b)` and forced users to also put b after GROUP BY. SPARK-34932 allows it by working like `GROUP BY GROUPING SETS(a, b)`, @viirya .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258006340

   to double check, what's the result of this query in other RDBMS? We should only use legacy config if the current behavior is the correct behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258524084

   Thank you for your feedback, @thiyaga .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #38001:
URL: https://github.com/apache/spark/pull/38001#discussion_r980647185


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3574,6 +3574,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+   val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY =
+    buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy")
+      .internal()
+      .doc("When true, grouping_id() returns values based on grouping set columns plus " +
+        "user-given group-by expressions order.")

Review Comment:
   Do we need to mention when the behavior of grouping_id has changed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] thiyaga commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
thiyaga commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258517989

   We use grouping sets on our queries and rely on `grouping__id` to use as an identifier to query the data for respective group. If we use `grouping__id` directly, it will be prone to change if grouping set changes (for e.g. adding new grouping set/ adding new column to existing grouping set). Any grouping id change will make things even more complex when consuming this data directly from reporting tools like Tableau . We need to do the one of the following options to mitigate the changing `grouping__id`
   
   1. Either we need to transform the `grouping__id` to something that won't be impacted when the grouping set changes and deterministic (for e.g convert `grouping__id` to `group_name`)
   2. Have some sort of logical DB view which will handle the transformation at runtime (for e.g. using CASE WHEN)
   
   In essence, we always have dependency with `grouping__id` when grouping sets are used in our query. Any change in the grouping id generation will have immediate impact. This new parameter will help us to use the legacy logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1259217641

   Thank you again, @cloud-fan , @viirya , @thiyaga,  @huaxingao , @zhengruifeng . 
   Since the last commit is about doc, I'll merge this.
   
   Merged to master/3.3/3.2.
   
   cc @wangyum , too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
URL: https://github.com/apache/spark/pull/38001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1257769296

   cc @cloud-fan , @wangyum , @viirya , @huaxingao , @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1259140729

   Thank you @dongjoon-hyun.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38001:
URL: https://github.com/apache/spark/pull/38001#discussion_r980867778


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3574,6 +3574,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+   val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY =
+    buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy")
+      .internal()
+      .doc("When true, grouping_id() returns values based on grouping set columns plus " +
+        "user-given group-by expressions order.")

Review Comment:
   I added the old versions, `3.2.0, 3.2.1, 3.2.2, and 3.3.0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258342234

   Is the grouping id values generated in Spark 3.2.0 same as previous versions (e.g. Spark 3.1.x and earlier)?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258332867

   @cloud-fan As you wrote in the PR description (https://github.com/apache/spark/pull/32022), it's not in the SQL standard, is it?
   > GROUP BY ... GROUPING SETS (...) is a weird SQL syntax we copied from Hive. It's not in the SQL standard or any other mainstream databases. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258815644

   > it's not in the SQL standard
   
   Yea, but since we copied it from Hive, I think the result should match Hive as well. Sorry I didn't realize there is a result change when doing the bug fix PR. Can we try with Hive to make sure we return the correct result today?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1259077129

   Thank you, @cloud-fan , @viirya , @huaxingao . Yes, as Wenchen shared, this is really Spark-specific syntax now. Let me add that to PR description.
   ```
   hive> SELECT version();
   OK
   3.1.3 r4df4d75bf1e16fe0af75aad0b4179c34c07fc975
   Time taken: 0.111 seconds, Fetched: 1 row(s)
   
   hive> SELECT count(*), grouping__id from t GROUP BY a GROUPING SETS (b);
   FAILED: SemanticException 1:63 [Error 10213]: Grouping sets expression is not in GROUP BY key. Error encountered near token 'b'
   ```
   
   I'll revise the config according to @viirya 's comment and will add a migration guide for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org