You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "cxzl25 (via GitHub)" <gi...@apache.org> on 2024/02/01 09:48:21 UTC

[PR] [SPARK-46943][SQL] Support for configuring ShuffledHashJoin plan size Threshold [spark]

cxzl25 opened a new pull request, #44982:
URL: https://github.com/apache/spark/pull/44982

   ### What changes were proposed in this pull request?
   Introduce the configuration `spark.sql.shuffledHashJoinThreshold`, the default is the maximum value of Long, which can limit the maximum value of the size converted to SHJ.
   
   ### Why are the changes needed?
   
   When we enable `spark.sql.join.preferSortMergeJoin=false`, we may get the following error.
   
   ```java
   org.apache.spark.SparkException: Can't acquire 1073741824 bytes memory to build hash relation, got 478549889 bytes
   	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotAcquireMemoryToBuildLongHashedRelationError(QueryExecutionErrors.scala:795)
   	at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:581)
   	at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:813)
   	at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:761)
   	at org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1064)
   	at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:153)
   	at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec.buildHashedRelation(ShuffledHashJoinExec.scala:75)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.init(Unknown Source)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6(WholeStageCodegenExec.scala:775)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6$adapted(WholeStageCodegenExec.scala:771)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
   ```
   
   Because when converting SMJ to SHJ, it only determines whether the size of the plan is smaller than `conf.autoBroadcastJoinThreshold * conf.numShufflePartitions`. 
   When the configured `numShufflePartitions` is large enough, it is easy to convert to SHJ. The executor build hash relation fails due to insufficient memory.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Production environment verification
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46943][SQL] Support for configuring ShuffledHashJoin plan size Threshold [spark]

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.

cxzl25 commented on code in PR #44982:
URL: https://github.com/apache/spark/pull/44982#discussion_r1474168625


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinSelectionHelperSuite.scala:
##########
@@ -165,15 +165,32 @@ class JoinSelectionHelperSuite extends PlanTest with JoinSelectionHelper {
   }
 
   test("getShuffleHashJoinBuildSide (hintOnly = false) return BuildRight when right is smaller") {
-    val broadcastSide = getBroadcastBuildSide(

Review Comment:
   The UT test implemented now is wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46943][SQL] Support for configuring ShuffledHashJoin plan size Threshold [spark]

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on PR #44982:
URL: https://github.com/apache/spark/pull/44982#issuecomment-1925545158

   Does `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold` satisfy your requirements ? It is more accurate to convert SMJ to SHJ in AQE by checking if the actually partition size less than threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46943][SQL] Support for configuring ShuffledHashJoin plan size Threshold [spark]

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.

cxzl25 closed pull request #44982: [SPARK-46943][SQL] Support for configuring ShuffledHashJoin plan size Threshold
URL: https://github.com/apache/spark/pull/44982


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46943][SQL] Support for configuring ShuffledHashJoin plan size Threshold [spark]

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.

cxzl25 commented on PR #44982:
URL: https://github.com/apache/spark/pull/44982#issuecomment-1923205545

   @cloud-fan @ulysses-you Please help review when you have time. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org