You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wankunde (via GitHub)" <gi...@apache.org> on 2023/05/30 03:45:07 UTC

[GitHub] [spark] wankunde opened a new pull request, #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

wankunde opened a new pull request, #41374:
URL: https://github.com/apache/spark/pull/41374

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   
   Spark will enable fast hash map for primitive data types in HashAggregateExec.
   Enable this for distinct queries which bufferSchema is empty.
   For example, we can also build a fast hash map with the key `a + b` for query `SELECT distinct a, b from tab`
   
   ### Why are the changes needed?
   
   Improve distinct queries performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Exists UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1236012516


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Could you address the above @viirya 's comment about test coverage, @wankunde ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries
URL: https://github.com/apache/spark/pull/41374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235957028


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   This condition was added in #12710. It was not explained clearly, but I suppose that it is added because for performance improvement on it. I read through related codes, seems this condition doesn't actually cause issue (I might be missing it out actually). It is better if you can confirm there are test coverage or you can add some.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "wankunde (via GitHub)" <gi...@apache.org>.
wankunde commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1236248814


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Hi, @viirya @dongjoon-hyun , thanks for your review. 
   Add a UT for this change: https://github.com/apache/spark/pull/41685



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235404204


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Just a question, do you use this patch in your production environment, @wankunde ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "wankunde (via GitHub)" <gi...@apache.org>.
wankunde commented on PR #41374:
URL: https://github.com/apache/spark/pull/41374#issuecomment-1567718047

   cc @sameeragarwal @c21 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "wankunde (via GitHub)" <gi...@apache.org>.
wankunde commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235507161


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   No, I'm sorry, I don't quite understand `bufferSchema.nonEmpty` is for what?
   
   For example, for query `SELECT distinct a, b from tab`  ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235551102


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Got it. 
   > No
   
   `bufferSchema.nonEmpty` was a legacy part of old commit in order to narrow down the scope in other cases. I also think it's not used in that case.
   >  I don't quite understand bufferSchema.nonEmpty is for what? For example, for query SELECT distinct a, b from tab ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235924087


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Hmm, I'm wondering if we have test coverage for this?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Hmm, I'm wondering if we have proper test coverage for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41374:
URL: https://github.com/apache/spark/pull/41374#issuecomment-1598995070

   cc @viirya


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41374:
URL: https://github.com/apache/spark/pull/41374#issuecomment-1599198896

   Merged to master for Apache Spark 3.5.0. Thank you, @wankunde .
   - https://spark.apache.org/versioning-policy.html (Apache Spark 3.5.0 on August 2023)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org