You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "wankunde (via GitHub)" <gi...@apache.org> on 2023/05/30 03:45:07 UTC

[GitHub] [spark] wankunde opened a new pull request, #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

wankunde opened a new pull request, #41374:
URL: https://github.com/apache/spark/pull/41374

### What changes were proposed in this pull request?

Spark will enable fast hash map for primitive data types in HashAggregateExec.
Enable this for distinct queries which bufferSchema is empty.
For example, we can also build a fast hash map with the key `a + b` for query `SELECT distinct a, b from tab`

### Why are the changes needed?

Improve distinct queries performance.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Exists UT

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1236012516


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Could you address the above @viirya 's comment about test coverage, @wankunde ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun closed pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries
URL: https://github.com/apache/spark/pull/41374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235957028


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   This condition was added in #12710. It was not explained clearly, but I suppose that it is added because for performance improvement on it. I read through related codes, seems this condition doesn't actually cause issue (I might be missing it out actually). It is better if you can confirm there are test coverage or you can add some.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "wankunde (via GitHub)" <gi...@apache.org>.

wankunde commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1236248814


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Hi, @viirya @dongjoon-hyun , thanks for your review. 
   Add a UT for this change: https://github.com/apache/spark/pull/41685



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235404204


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Just a question, do you use this patch in your production environment, @wankunde ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wankunde commented on pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "wankunde (via GitHub)" <gi...@apache.org>.

wankunde commented on PR #41374:
URL: https://github.com/apache/spark/pull/41374#issuecomment-1567718047

   cc @sameeragarwal @c21 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "wankunde (via GitHub)" <gi...@apache.org>.

wankunde commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235507161


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   No, I'm sorry, I don't quite understand `bufferSchema.nonEmpty` is for what?
   
   For example, for query `SELECT distinct a, b from tab`  ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235551102


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Got it. 
   > No
   
   `bufferSchema.nonEmpty` was a legacy part of old commit in order to narrow down the scope in other cases. I also think it's not used in that case.
   >  I don't quite understand bufferSchema.nonEmpty is for what? For example, for query SELECT distinct a, b from tab ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #41374:
URL: https://github.com/apache/spark/pull/41374#discussion_r1235924087


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Hmm, I'm wondering if we have test coverage for this?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala:
##########
@@ -383,8 +383,7 @@ case class HashAggregateExec(
     val isSupported =
       (groupingKeySchema ++ bufferSchema).forall(f => CodeGenerator.isPrimitiveType(f.dataType) ||
         f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType] ||
-        f.dataType.isInstanceOf[CalendarIntervalType]) &&
-        bufferSchema.nonEmpty

Review Comment:
   Hmm, I'm wondering if we have proper test coverage for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #41374:
URL: https://github.com/apache/spark/pull/41374#issuecomment-1598995070

   cc @viirya


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #41374: [SPARK-43876][SQL] Enable fast hashmap for distinct queries

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #41374:
URL: https://github.com/apache/spark/pull/41374#issuecomment-1599198896

   Merged to master for Apache Spark 3.5.0. Thank you, @wankunde .
   - https://spark.apache.org/versioning-policy.html (Apache Spark 3.5.0 on August 2023)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org