You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/01 08:56:50 UTC

[GitHub] [spark] wangyum opened a new pull request, #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

wangyum opened a new pull request, #38464:
URL: https://github.com/apache/spark/pull/38464

   ### What changes were proposed in this pull request?
   
   This PR enhances DPP to use bloom filters if `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` is disabled and build plan can't build broadcast by size and can reuse the existing shuffle exchanges.
   
   ### Why are the changes needed?
   
   Avoid job fail if `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` is disabled:
   ```sql
   select catalog_sales.* from  catalog_sales join catalog_returns  where cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 40000;
   ```
   ```
   20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #38464:
URL: https://github.com/apache/spark/pull/38464#issuecomment-1298235836

   cc @cloud-fan @sigmod @aokolnychyi @dongjoon-hyun @huaxingao @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38464:
URL: https://github.com/apache/spark/pull/38464#issuecomment-1298287656

   Thank you for pinging me, @wangyum .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38464:
URL: https://github.com/apache/spark/pull/38464#discussion_r1023678622


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala:
##########
@@ -65,7 +70,7 @@ case class PlanAdaptiveDynamicPruningFilters(
           DynamicPruningExpression(InSubqueryExec(value, broadcastValues, exprId))
         } else if (onlyInBroadcast) {
           DynamicPruningExpression(Literal.TrueLiteral)
-        } else {
+        } else if (canBroadcastBySize(buildPlan, conf)) {

Review Comment:
   this can be over estimated. The final plan has an `Aggregate` which may dramatically reduce the data size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38464:
URL: https://github.com/apache/spark/pull/38464#issuecomment-1316611596

   I agree with using bloom filters, as the size estimation can be wrong and the build size can be too large that `InSubquery` can't work. However, this PR contains another optimization that forces shuffle reuse when building the subquery to build bloom filter. Can we do it later with more discussions? This is a general optimization that can apply in other places as well: InSubquery DPP, bloom filter join.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38464:
URL: https://github.com/apache/spark/pull/38464#discussion_r1023684019


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala:
##########
@@ -77,6 +82,24 @@ case class PlanAdaptiveDynamicPruningFilters(
           val newAdaptivePlan = sparkPlan.asInstanceOf[AdaptiveSparkPlanExec]
           val values = SubqueryExec(name, newAdaptivePlan)
           DynamicPruningExpression(InSubqueryExec(value, values, exprId))
+        } else if (!conf.exchangeReuseEnabled) {
+          DynamicPruningExpression(Literal.TrueLiteral)
+        } else {
+          val childPlan = adaptivePlan.executedPlan
+          val reusedShuffleExchange = collectFirst(rootPlan) {
+            case s: ShuffleExchangeExec if s.child.sameResult(childPlan) => s

Review Comment:
   This is another tricky part: is reusing shuffle always better than starting a new query with column pruning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning
URL: https://github.com/apache/spark/pull/38464


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38464:
URL: https://github.com/apache/spark/pull/38464#issuecomment-1444788358

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org