You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "stefankandic (via GitHub)" <gi...@apache.org> on 2023/12/29 14:49:45 UTC

[PR] [WIP][SPARK-46536]Support GROUP BY calendar_interval_type [spark]

stefankandic opened a new pull request, #44538:
URL: https://github.com/apache/spark/pull/44538

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441184699


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala:
##########
@@ -30,7 +30,7 @@ import org.apache.spark.sql.types._
 object TypeUtils extends QueryErrorsBase {
 
   def checkForOrderingExpr(dt: DataType, caller: String): TypeCheckResult = {

Review Comment:
   Let's not touch it to reduce the blast radius.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441762438


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,34 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    val numRows = 50
+    val configurations = Seq(
+      "spark.sql.test.forceApplyHashAggregate",
+      "spark.sql.test.forceApplyObjectHashAggregate",
+      "spark.sql.test.forceApplySortAggregate"
+    )
+
+    for (conf <- configurations) {
+      withSQLConf(conf -> "true") {

Review Comment:
   We should set more configs to trigger sort fallback of (object) hash aggregate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #44538: [SPARK-46536][SQL] Support GROUP BY calendar_interval_type
URL: https://github.com/apache/spark/pull/44538


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440312889


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,15 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  @Override
+  public int compareTo(CalendarInterval o) {
+    if (this.months != o.months) {

Review Comment:
   We should add some comments to explain that this is alphabet ordering. It does not have actual meaning but just makes it possible to find identical interval instances.
   
   We should do the same thing for map type so that we can group by map values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441185813


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:
##########
@@ -457,4 +458,38 @@ class ObjectHashAggregateSuite
       )
     }
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    withSQLConf(
+      SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
+      SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
+    ) {
+
+      val numRows = 50
+
+      val dfSame = createAggregate(_ => Tuple1(new CalendarInterval(1, 2, 3)), numRows)
+      assert(getPlanOutput(dfSame).contains("HashAggregate"))

Review Comment:
   there are two hash aggregate operators in Spark: `HashAggregateExec` and `ObjectHashAggregateExec`. Can we test both?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440555565


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashMapGenerator.scala:
##########
@@ -174,6 +174,7 @@ abstract class HashMapGenerator(
           """
         }
       case StringType => hashBytes(s"$input.getBytes()")
+      case CalendarIntervalType => hashInt(s"$input.hashCode()")

Review Comment:
   OK this seems fine, the hash result can be either int or long in the generated code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440601728


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala:
##########
@@ -30,7 +30,7 @@ import org.apache.spark.sql.types._
 object TypeUtils extends QueryErrorsBase {
 
   def checkForOrderingExpr(dt: DataType, caller: String): TypeCheckResult = {

Review Comment:
   good catch, there are 20 usages of the function:
   <img width="722" alt="Screenshot 2024-01-03 at 16 32 54" src="https://github.com/apache/spark/assets/154237371/2096d1b9-00c9-47ed-b12d-27cc13a248f8">
   
   I don't have enough context, but should we maybe only change the implementation in some of these like SortOrder and aggregate ones?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440318155


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashMapGenerator.scala:
##########
@@ -174,6 +174,7 @@ abstract class HashMapGenerator(
           """
         }
       case StringType => hashBytes(s"$input.getBytes()")
+      case CalendarIntervalType => hashInt(s"$input.hashCode()")

Review Comment:
   looking at the code here, I think we need a dedicated hash implementation for calendar interval, to return an int.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440313583


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,15 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  @Override
+  public int compareTo(CalendarInterval o) {
+    if (this.months != o.months) {

Review Comment:
   @stefankandic did you generate this using IDEA?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441635365


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,32 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    // forces the use of sort aggregate by using min/max functions
+
+    val numRows = 50
+    val numRepeat = 25
+
+    val df = (0 to numRows)
+      .map(i => Tuple1(new CalendarInterval(i, i, i)))
+      .toDF("c0")
+
+    for (_ <- 0 until numRepeat) {
+      val shuffledDf = df.orderBy(rand())
+
+      checkAnswer(
+        shuffledDf.agg(max("c0")),

Review Comment:
   agreed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440605402


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,15 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  @Override
+  public int compareTo(CalendarInterval o) {
+    if (this.months != o.months) {

Review Comment:
   Added the comments.
   
   @cloud-fan method was generated by intellij but I implemented the logic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441185604


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:
##########
@@ -457,4 +458,38 @@ class ObjectHashAggregateSuite
       )
     }
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    withSQLConf(
+      SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
+      SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
+    ) {
+
+      val numRows = 50
+
+      val dfSame = createAggregate(_ => Tuple1(new CalendarInterval(1, 2, 3)), numRows)
+      assert(getPlanOutput(dfSame).contains("HashAggregate"))
+      assert(dfSame.count() == 1)
+
+      val dfDifferent = createAggregate(i => Tuple1(new CalendarInterval(i, i, i)), numRows)
+      assert(getPlanOutput(dfDifferent).contains("HashAggregate"))
+      assert(dfDifferent.count() == numRows)
+    }
+
+    def createAggregate(f: Int => Tuple1[CalendarInterval], numRows: Int): DataFrame = {
+      (1 to numRows)
+        .map(i => f(i))
+        .toDF("c0")
+        .groupBy("c0")
+        .agg(count("*"))
+    }
+
+    def getPlanOutput(df: DataFrame): String = {
+      val outputStream = new java.io.ByteArrayOutputStream()
+      Console.withOut(outputStream) {
+        df.explain(true)

Review Comment:
   oh, I mean offline checking... We don't need to test the result of explain



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1439129096


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,15 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  @Override
+  public int compareTo(CalendarInterval o) {
+    if (this.months != o.months) {

Review Comment:
   Comparing intervals does not necessarily short circuits via months. We could result in `1 month > 0 months 32 days`, which is wrong, obviously.
   
   Besides, 1 month can be 28 ~ 30 days, making the legacy calendar interval type uncomparable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441759661


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala:
##########
@@ -77,9 +77,12 @@ object AggUtils {
       child: SparkPlan): SparkPlan = {
     val useHash = Aggregate.supportsHashAggregate(
       aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes))
+
+    val forceHashAggregate = forceApplyHashAggregate(child.conf)

Review Comment:
   I don't think we need this config. HashAggregateExec is the first choice. We will always pick it if the query is applicable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441635167


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala:
##########
@@ -30,7 +30,7 @@ import org.apache.spark.sql.types._
 object TypeUtils extends QueryErrorsBase {
 
   def checkForOrderingExpr(dt: DataType, caller: String): TypeCheckResult = {

Review Comment:
   turns out we didn't need to change this at all, my bad 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1446833880


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,26 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  /**
+   * This method is not used to order CalendarInterval instances, as they are not orderable and
+   * can't be used in a ORDER BY statement.

Review Comment:
   ```suggestion
      * cannot be used in a ORDER BY statement.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "srielau (via GitHub)" <gi...@apache.org>.
srielau commented on PR #44538:
URL: https://github.com/apache/spark/pull/44538#issuecomment-1872579314

   We (i.e. @MaxGekk) have added standard year-month and day-time intervals which are much better than calendar interval.
   It is of questionable value to improve this old type.
   @cloud-fan @MaxGekk WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440641170


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:
##########
@@ -457,4 +458,31 @@ class ObjectHashAggregateSuite
       )
     }
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    withSQLConf(
+      SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
+      SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
+    ) {
+      val numRows = 50
+
+      assert(
+        (1 to numRows)
+          .map(_ => Tuple1(new CalendarInterval(1, 2, 3)))
+          .toDF("c0")
+          .groupBy("c0")
+          .agg(count("*"))

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441776159


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,34 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    val numRows = 50
+    val configurations = Seq(
+      "spark.sql.test.forceApplyHashAggregate",
+      "spark.sql.test.forceApplyObjectHashAggregate",
+      "spark.sql.test.forceApplySortAggregate"
+    )
+
+    for (conf <- configurations) {
+      withSQLConf(conf -> "true") {

Review Comment:
   Why would we need that if this way we already test all three of them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440500767


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashMapGenerator.scala:
##########
@@ -174,6 +174,7 @@ abstract class HashMapGenerator(
           """
         }
       case StringType => hashBytes(s"$input.getBytes()")
+      case CalendarIntervalType => hashInt(s"$input.hashCode()")

Review Comment:
   isn't this already calling the `hashcode` method of the calendar interval?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441185223


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,32 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    // forces the use of sort aggregate by using min/max functions
+
+    val numRows = 50
+    val numRepeat = 25
+
+    val df = (0 to numRows)
+      .map(i => Tuple1(new CalendarInterval(i, i, i)))
+      .toDF("c0")
+
+    for (_ <- 0 until numRepeat) {
+      val shuffledDf = df.orderBy(rand())
+
+      checkAnswer(
+        shuffledDf.agg(max("c0")),

Review Comment:
   the test does not even have GROUP BY ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441184040


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,27 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  /**
+   * This method is not used to order CalendarInterval instances, as they are not orderable and
+   * can't be used in a ORDER BY statement.
+   * Instead, it is used to find identical interval instances for aggregation purposes.
+   * It compares the 'months', 'days', and 'microseconds' fields of this CalendarInterval
+   * with another instance. The comparison is done first on the 'months', then on the 'days',
+   * and finally on the 'microseconds'.
+   *
+   * @param o The CalendarInterval instance to compare with.
+   * @return A negative integer, zero, or a positive integer as this object is less than,

Review Comment:
   let's just say: `returns zero if this object is equal to the specified object, and non-zero otherwise.`
   
   It's a "fake order". We only need it to find identical instances, and the order itself has no meanings. We only need it to be deterministic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441184040


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,27 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  /**
+   * This method is not used to order CalendarInterval instances, as they are not orderable and
+   * can't be used in a ORDER BY statement.
+   * Instead, it is used to find identical interval instances for aggregation purposes.
+   * It compares the 'months', 'days', and 'microseconds' fields of this CalendarInterval
+   * with another instance. The comparison is done first on the 'months', then on the 'days',
+   * and finally on the 'microseconds'.
+   *
+   * @param o The CalendarInterval instance to compare with.
+   * @return A negative integer, zero, or a positive integer as this object is less than,

Review Comment:
   let's just say: `returns zero if this object is equal to the specified object, and non-zero otherwise.`
   
   It's a "fake order". We only need it to find identical instances, and the order itself has no meanings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1438554749


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -44,7 +44,7 @@
  * @since 3.0.0
  */
 @Unstable
-public final class CalendarInterval implements Serializable {
+public final class CalendarInterval implements Serializable, Comparable<CalendarInterval> {

Review Comment:
   Interesting. Please check the behavior
   https://github.com/apache/spark/pull/27262
   I'm not sure. @yaooqinn



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440319623


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SortAggregateSuite.scala:
##########
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper
+import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.tags.SlowHiveTest
+import org.apache.spark.unsafe.types.CalendarInterval
+
+@SlowHiveTest
+class SortAggregateSuite
+  extends QueryTest
+    with SQLTestUtils
+    with TestHiveSingleton
+    with ExpressionEvalHelper
+    with AdaptiveSparkPlanHelper {
+
+  import testImplicits._
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {

Review Comment:
   we can put the new test cases in `DataFrameAggregateSuite`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440312889


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,15 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  @Override
+  public int compareTo(CalendarInterval o) {
+    if (this.months != o.months) {

Review Comment:
   We should add some comments to explain that this is alphabet ordering. It does not have actual meaning but just makes it possible to find identical interval instances.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441850928


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,34 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    val numRows = 50
+    val configurations = Seq(
+      "spark.sql.test.forceApplyHashAggregate",
+      "spark.sql.test.forceApplyObjectHashAggregate",
+      "spark.sql.test.forceApplySortAggregate"
+    )
+
+    for (conf <- configurations) {
+      withSQLConf(conf -> "true") {

Review Comment:
   Added new config which also sets fallback threshold to 1, let me know if that is not enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440314519


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##########
@@ -193,8 +193,8 @@ object ExprUtils extends QueryErrorsBase {
           messageParameters = Map("sqlExpr" -> expr.sql))
       }
 
-      // Check if the data type of expr is orderable.
-      if (!RowOrdering.isOrderable(expr.dataType)) {
+      // Check if the data type of expr can be used in group by
+      if (!canBeUsedInGroupBy(expr.dataType)) {

Review Comment:
   nit: `expr.dataType.existsRecursively(_.isInstanceOf[MapType])`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440316276


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala:
##########
@@ -30,7 +30,7 @@ import org.apache.spark.sql.types._
 object TypeUtils extends QueryErrorsBase {
 
   def checkForOrderingExpr(dt: DataType, caller: String): TypeCheckResult = {

Review Comment:
   can we check all the call sites of this function?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441185889


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:
##########
@@ -457,4 +458,38 @@ class ObjectHashAggregateSuite
       )
     }
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {

Review Comment:
   let's put this test in `DataFrameAggregateSuite` as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441762796


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala:
##########
@@ -77,9 +77,12 @@ object AggUtils {
       child: SparkPlan): SparkPlan = {
     val useHash = Aggregate.supportsHashAggregate(
       aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes))
+
+    val forceHashAggregate = forceApplyHashAggregate(child.conf)

Review Comment:
   I did it because I like that this way it is immediately obvious in the test case what the config is saying instead of relying on a default which could technically change at any time.
   
   However if you disagree I'm not against removing it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44538:
URL: https://github.com/apache/spark/pull/44538#issuecomment-1884121328

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1442793883


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,35 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    val numRows = 50
+    val configurations = Seq(
+      Seq.empty[(String, String)], // hash aggregate is used by default

Review Comment:
   We also need to set `spark.sql.TungstenAggregate.testFallbackStartsAt`, see the code in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L81



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441766668


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala:
##########
@@ -77,9 +77,12 @@ object AggUtils {
       child: SparkPlan): SparkPlan = {
     val useHash = Aggregate.supportsHashAggregate(
       aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes))
+
+    val forceHashAggregate = forceApplyHashAggregate(child.conf)

Review Comment:
   I prefer to remove it as it makes the code here confusing. `HashAggregateExec` is not applicable to all queries and the behavior is undefined if we force it. The other two aggregate operators support all queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.
stefankandic commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441776159


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,34 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    val numRows = 50
+    val configurations = Seq(
+      "spark.sql.test.forceApplyHashAggregate",
+      "spark.sql.test.forceApplyObjectHashAggregate",
+      "spark.sql.test.forceApplySortAggregate"
+    )
+
+    for (conf <- configurations) {
+      withSQLConf(conf -> "true") {

Review Comment:
   Why would we need that if this way we already test all three of them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1446833880


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:
##########
@@ -127,4 +127,26 @@ private void appendUnit(StringBuilder sb, long value, String unit) {
    * @throws ArithmeticException if a numeric overflow occurs
    */
   public Duration extractAsDuration() { return Duration.of(microseconds, ChronoUnit.MICROS); }
+
+  /**
+   * This method is not used to order CalendarInterval instances, as they are not orderable and
+   * can't be used in a ORDER BY statement.

Review Comment:
   ```suggestion
      * cannot be used in a ORDER BY statement.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1440557569


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:
##########
@@ -457,4 +458,31 @@ class ObjectHashAggregateSuite
       )
     }
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    withSQLConf(
+      SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
+      SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
+    ) {
+      val numRows = 50
+
+      assert(
+        (1 to numRows)
+          .map(_ => Tuple1(new CalendarInterval(1, 2, 3)))
+          .toDF("c0")
+          .groupBy("c0")
+          .agg(count("*"))

Review Comment:
   This isn't clear to me that we end up with object hash aggregate. Can you verify it with `.explain(true)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441185359


##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -2125,6 +2126,32 @@ class DataFrameAggregateSuite extends QueryTest
       Seq(Row(1))
     )
   }
+
+  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
+    // forces the use of sort aggregate by using min/max functions
+
+    val numRows = 50
+    val numRepeat = 25
+
+    val df = (0 to numRows)
+      .map(i => Tuple1(new CalendarInterval(i, i, i)))
+      .toDF("c0")
+
+    for (_ <- 0 until numRepeat) {
+      val shuffledDf = df.orderBy(rand())
+
+      checkAnswer(
+        shuffledDf.agg(max("c0")),

Review Comment:
   And we shouldn't support this. calendar interval is not orderable and can't support min/max.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46536][SQL] Support GROUP BY calendar_interval_type [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44538:
URL: https://github.com/apache/spark/pull/44538#discussion_r1441184699


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala:
##########
@@ -30,7 +30,7 @@ import org.apache.spark.sql.types._
 object TypeUtils extends QueryErrorsBase {
 
   def checkForOrderingExpr(dt: DataType, caller: String): TypeCheckResult = {

Review Comment:
   Let's not touch it to reduce the blast radius. We should only update certain call sites that need to handle calendar intervals.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org