You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/01 06:51:14 UTC

[GitHub] [spark] sumeetgajjar commented on a change in pull request #35047: [SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys

sumeetgajjar commented on a change in pull request #35047:
URL: https://github.com/apache/spark/pull/35047#discussion_r796302266



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3380,6 +3380,17 @@ object SQLConf {
       .checkValue(_ >= 0, "The value must be non-negative.")
       .createWithDefault(8)
 
+  val HASHED_RELATION_REORDER_FACTOR = buildConf("spark.sql.hashedRelationReorderFactor")
+    .doc("The HashedRelation will be reordered if the number of unique keys times this factor is " +
+      "less than equal to the total number of rows in the HashedRelation. " +
+      "The reordering places all rows with the same key adjacent to each other to improve " +
+      "spatial locality. This provides a performance boost while iterating over the rows for a " +
+      "given key due to increased cache hits")
+    .version("3.3.0")
+    .doubleConf
+    .checkValue(_ > 1, "The value must be greater than 1.")
+    .createOptional

Review comment:
       Apologies for the delayed response, I was stuck with some work stuff followed by a sick week due to covid.
   
   > @sumeetgajjar - could you help elaborate more why a global default value is sufficient per my question above?
   
   @c21 My rationale behind suggesting a global value of 4 was based on the experiment that I ran. I ran a synthetic workload with the HashOptimization enabled no matter the duplication factor of the keys. I gradually iterated over the duplication factor from 1 to 20, I noticed the optimization to be beneficial right after the duplication factor crossed a value of 4. Thus based on the experiment I conducted locally, I suggested a value of 4.
   
   > If the probe side has many looks up of keys with a lot of values, then we can see the improvement. But if the probe side does not look up much for these keys, then we probably cannot see the benefit. 
   
   I agree, the synthetic workload I was running queried the probe side such that the majority of the keys had multiple values. 
   
   Anyways, due to concerns over the added memory pressure introduced by this Optimization and the feedback received on the config being difficult to tweak, I've decided to close the PR. In case I find a better solution, I'll reopen the PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org