You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/17 03:02:42 UTC
[GitHub] [spark] wangyum opened a new pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
wangyum opened a new pull request #29726:
URL: https://github.com/apache/spark/pull/29726
### What changes were proposed in this pull request?
For some filtering side can not broadcast by join type but can broadcast by size, then we should not consider reuse broadcast only, for example:
Left outer join and left side very small.
### Why are the changes needed?
Improve query performance for these cases:
![image](https://user-images.githubusercontent.com/5399861/92882197-445a2a00-f442-11ea-955d-16a7724e535b.png)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781030094
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39795/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803644398
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136313/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807964047
Thank you, @wangyum and all. I left a few minor comments.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807897409
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41124/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780332095
**[Test build #135187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135187/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
* This patch passes all tests.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-806940688
**[Test build #136527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136527/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798199570
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136028/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum edited a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780266596
```scala
sql("set spark.sql.adaptive.enabled=false")
sql("CREATE TABLE t1 using parquet partitioned by (b) AS SELECT id AS a, id % 1000 AS b FROM range(200000000L) distribute by b")
sql("CREATE TABLE t2 using parquet AS SELECT id as c,id AS d FROM range(2000)")
sql("SELECT count(*) FROM t2 left join t1 on t1.b=t2.c where t2.d < 10").show
```
Before this pr | After this pr
-- | --
![image](https://user-images.githubusercontent.com/5399861/108152681-8d5e8a80-7114-11eb-9e9d-ca6bb5bb3e29.png) | ![image](https://user-images.githubusercontent.com/5399861/108150840-59816600-7110-11eb-8b94-c2ad740d125d.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780319875
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39768/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807907173
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41124/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780383012
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39774/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781017325
**[Test build #135214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135214/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780405252
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39774/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594
cc @cloud-fan
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780296498
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39768/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543
**[Test build #128558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128558/testReport)** for PR 29726 at commit [`2986aad`](https://github.com/apache/spark/commit/2986aad31a98b83397ce216346112e041663d9f4).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803610725
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40895/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807964686
**[Test build #136540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136540/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807907173
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41124/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807023785
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41111/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602034890
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -307,6 +307,17 @@ object SQLConf {
.booleanConf
.createWithDefault(true)
+ val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =
+ buildConf("spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio")
+ .internal()
+ .doc("When filtering side doesn't support broadcast by join type, and doing DPP means " +
+ "running an extra query that may have significant overhead. This config will be used " +
+ "as the extra filter ratio for computing the data size of the pruning side after DPP, " +
+ "in order to evaluate if it is worth adding an extra subquery as the pruning filter.")
+ .version("3.2.0")
+ .doubleConf
Review comment:
Can we have `.checkValue` for the expected range, `0.0 <=` and `<= 1.0`? Or, are we allowing more possibility like `2.0`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807319957
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136527/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780405252
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39774/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781054181
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39795/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780270050
**[Test build #135187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135187/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807969383
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136540/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maryannxue commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r589591161
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -85,22 +86,36 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
filteringKey: Expression,
filteringPlan: LogicalPlan,
joinKeys: Seq[Expression],
- partScan: LogicalRelation): LogicalPlan = {
- val reuseEnabled = SQLConf.get.exchangeReuseEnabled
+ partScan: LogicalRelation,
+ canBuildBroadcast: Boolean): LogicalPlan = {
val index = joinKeys.indexOf(filteringKey)
- lazy val hasBenefit = pruningHasBenefit(pruningKey, partScan, filteringKey, filteringPlan)
- if (reuseEnabled || hasBenefit) {
- // insert a DynamicPruning wrapper to identify the subquery during query planning
+ lazy val hasBenefit = pruningHasBenefit(
Review comment:
I don't quite get it here. `pruningHasBenefit` was meant to return "whether it's worth pruning even if there's no broadcast reuse, in other words, even if we have to run an extra subquery on its own", so why do we need to pass over `canBuildBroadcast` to this method?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570828
**[Test build #128594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128594/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803610238
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40895/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29726:
URL: https://github.com/apache/spark/pull/29726
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691085751
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780348226
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135187/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803599521
**[Test build #136313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136313/testReport)** for PR 29726 at commit [`1c0f2a0`](https://github.com/apache/spark/commit/1c0f2a046ab98746f45ce85b98b53f30b31481e9).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807036912
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41111/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807969383
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136540/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807037016
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41111/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798195887
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40612/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780266596
Before this pr | After this pr
-- | --
![image](https://user-images.githubusercontent.com/5399861/108150821-4f5f6780-7110-11eb-9c9b-a8708ee2bbe9.png) | ![image](https://user-images.githubusercontent.com/5399861/108150840-59816600-7110-11eb-8b94-c2ad740d125d.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807901054
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41124/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780319875
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39768/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r598289449
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +151,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => fallbackRatio
}
+ val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
// the pruning overhead is the total size in bytes of all scan relations
val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
- filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+ if (canBuildBroadcast) {
+ estimatePruningSideSize > overhead
+ } else {
+ // We need to make sure that otherPlan can be broadcast by size to avoid driver OOM
+ canBroadcastBySize(otherPlan, conf) && estimatePruningSideSize * 0.04 > overhead
Review comment:
(pruning side stats size) * 0.04 * 0.5(spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio) > 10M(spark.sql.autoBroadcastJoinThreshold)
The (pruning side stats size) around 500M.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r577353386
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -196,6 +198,25 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => false
}
+ /**
+ * For some filtering side can not broadcast by join type but can broadcast by size,
+ * then we should not consider reuse broadcast only, for example:
+ * Left outer join and left side very small.
+ */
+ private def isReuseBroadcastOnly(
+ canBuildBroadcast: Boolean,
+ hasBenefit: Boolean,
+ hintToBroadcastPruningSide: Boolean,
+ pruningPlan: LogicalPlan,
+ filteringPlan: LogicalPlan): Boolean = {
+ if (canBuildBroadcast) {
+ !hasBenefit || SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly
+ } else {
+ !(hasBenefit && canBroadcastBySize(filteringPlan, SQLConf.get) &&
+ !hintToBroadcastPruningSide && !canBroadcastBySize(pruningPlan, SQLConf.get))
Review comment:
Yes. To avoid driver OOM.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602035906
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -146,10 +150,18 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => fallbackRatio
}
+ val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
// the pruning overhead is the total size in bytes of all scan relations
val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
- filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+ if (canBuildBroadcast) {
+ estimatePruningSideSize > overhead
+ } else {
+ // We can't reuse the broadcast because the join type doesn't support broadcast,
+ // and doing DPP means running an extra query that may have significant overhead.
+ // We need to make sure the pruning side is very big so that DPP is still worthy.
+ canBroadcastBySize(otherPlan, conf) &&
+ estimatePruningSideSize * conf.dynamicPartitionPruningPruningSideExtraFilterRatio > overhead
Review comment:
Could you update this function description according to this new configuration, please? For now, we are mentioning only `spark.sql.optimizer.joinFilterRatio`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798199570
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803610725
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40895/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798208482
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40612/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780348226
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135187/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602032483
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -307,6 +307,17 @@ object SQLConf {
.booleanConf
.createWithDefault(true)
+ val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =
+ buildConf("spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio")
+ .internal()
Review comment:
Indentation: two more spaces?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570828
**[Test build #128594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128594/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803599521
**[Test build #136313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136313/testReport)** for PR 29726 at commit [`1c0f2a0`](https://github.com/apache/spark/commit/1c0f2a046ab98746f45ce85b98b53f30b31481e9).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r600540752
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -146,10 +150,17 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => fallbackRatio
}
+ val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
// the pruning overhead is the total size in bytes of all scan relations
val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
- filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+ if (canBuildBroadcast) {
+ estimatePruningSideSize > overhead
+ } else {
+ // We can't reuse the broadcast because the join type doesn't support broadcast,
+ // and doing DPP means running an extra query that may have significant overhead.
+ // We need to make the pruning side is very big so that DPP is still worthy.
Review comment:
`make` -> `make sure`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807829270
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-750674816
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781078108
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135214/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781078108
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135214/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807319957
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136527/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594
cc @cloud-fan
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807932345
thanks, merging to master!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807293466
**[Test build #136527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136527/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084456
**[Test build #128558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128558/testReport)** for PR 29726 at commit [`2986aad`](https://github.com/apache/spark/commit/2986aad31a98b83397ce216346112e041663d9f4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r600541047
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +151,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => fallbackRatio
}
+ val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
// the pruning overhead is the total size in bytes of all scan relations
val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
- filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+ if (canBuildBroadcast) {
+ estimatePruningSideSize > overhead
+ } else {
+ // We need to make sure that otherPlan can be broadcast by size to avoid driver OOM
+ canBroadcastBySize(otherPlan, conf) && estimatePruningSideSize * 0.04 > overhead
Review comment:
can we make `0.04` configurable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-806940688
**[Test build #136527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136527/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807037016
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41111/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691606128
**[Test build #128594 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128594/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781074426
**[Test build #135214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135214/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780350692
**[Test build #135193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135193/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29726:
URL: https://github.com/apache/spark/pull/29726
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570962
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807861127
**[Test build #136540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136540/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780506569
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135193/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691606352
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum edited a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780266596
```scala
sql("set spark.sql.adaptive.enabled=false")
sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
sql("CREATE TABLE t1 using parquet partitioned by (b) AS SELECT id AS a, id % 1000 AS b FROM range(200000000L) distribute by b")
sql("CREATE TABLE t2 using parquet AS SELECT id as c,id AS d FROM range(2000)")
sql("SELECT count(*) FROM t2 left join t1 on t1.b=t2.c where t2.d < 10").show
```
Before this pr | After this pr
-- | --
![image](https://user-images.githubusercontent.com/5399861/108152681-8d5e8a80-7114-11eb-9e9d-ca6bb5bb3e29.png) | ![image](https://user-images.githubusercontent.com/5399861/108150840-59816600-7110-11eb-8b94-c2ad740d125d.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780374411
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39774/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803644066
**[Test build #136313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136313/testReport)** for PR 29726 at commit [`1c0f2a0`](https://github.com/apache/spark/commit/1c0f2a046ab98746f45ce85b98b53f30b31481e9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r591660490
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -85,22 +86,36 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
filteringKey: Expression,
filteringPlan: LogicalPlan,
joinKeys: Seq[Expression],
- partScan: LogicalRelation): LogicalPlan = {
- val reuseEnabled = SQLConf.get.exchangeReuseEnabled
+ partScan: LogicalRelation,
+ canBuildBroadcast: Boolean): LogicalPlan = {
val index = joinKeys.indexOf(filteringKey)
- lazy val hasBenefit = pruningHasBenefit(pruningKey, partScan, filteringKey, filteringPlan)
- if (reuseEnabled || hasBenefit) {
- // insert a DynamicPruning wrapper to identify the subquery during query planning
+ lazy val hasBenefit = pruningHasBenefit(
Review comment:
This is used to set `onlyInBroadcast` to false. Because `SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly ||! hasBenefit` is always true by default.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r487866106
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -196,6 +198,25 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => false
}
+ /**
+ * For some filtering side can not broadcast by join type but can broadcast by size,
+ * then we should not consider reuse broadcast only, for example:
+ * Left outer join and left side very small.
+ */
+ private def isReuseBroadcastOnly(
+ canBuildBroadcast: Boolean,
+ hasBenefit: Boolean,
+ hintToBroadcastPruningSide: Boolean,
+ pruningPlan: LogicalPlan,
+ filteringPlan: LogicalPlan): Boolean = {
+ if (canBuildBroadcast) {
+ !hasBenefit || SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly
+ } else {
+ !(hasBenefit && canBroadcastBySize(filteringPlan, SQLConf.get) &&
+ !hintToBroadcastPruningSide && !canBroadcastBySize(pruningPlan, SQLConf.get))
Review comment:
`!canBroadcastBySize(pruningPlan, SQLConf.get)` do we need this condition?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780270050
**[Test build #135187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135187/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r595823873
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +151,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => fallbackRatio
}
+ val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
// the pruning overhead is the total size in bytes of all scan relations
val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
- filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+ if (canBuildBroadcast) {
+ estimatePruningSideSize > overhead
+ } else {
+ // We need to make sure that otherPlan can be broadcast by size to avoid driver OOM
+ canBroadcastBySize(otherPlan, conf) && estimatePruningSideSize * 0.04 > overhead
Review comment:
We should explain the rationale clearly. The problem here is we can't reuse the broadcast (because the join type doesn't support broadcast), and doing DPP means running an extra query that may have significant overhead. We need to make the pruning side is very big so that DPP is still worthy.
BTW how do you come up with 0.04?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #29726:
URL: https://github.com/apache/spark/pull/29726
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803644398
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136313/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807861127
**[Test build #136540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136540/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691085751
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570962
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780302419
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39768/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543
**[Test build #128558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128558/testReport)** for PR 29726 at commit [`2986aad`](https://github.com/apache/spark/commit/2986aad31a98b83397ce216346112e041663d9f4).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781054181
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39795/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803609317
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40895/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781017325
**[Test build #135214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135214/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602032154
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -307,6 +307,17 @@ object SQLConf {
.booleanConf
.createWithDefault(true)
+ val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =
Review comment:
`PARTITON` -> `PARTITION`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780483240
**[Test build #135193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135193/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798208542
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40612/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781038279
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39795/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780506569
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135193/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780350692
**[Test build #135193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135193/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maryannxue commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
maryannxue commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-792915413
If I understand correctly (please make it clearer in your PR description), you are trying to enable DPP even if the filter is not broadcast-able but is small enough, I do not think it is a proper solution to hack through this "broadcastOnly" flag. Instead, you should improve the cost model in `pruningHasBenefit` in this PR and change the config `broadcastOnly` to `false` in your own application. And if you can further prove that with the improved cost model, you get performance improvement across the board, you can suggest flipping the default conf value in this PR or another follow-up one.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r591671088
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +163,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => fallbackRatio
}
+ val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
// the pruning overhead is the total size in bytes of all scan relations
val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
- filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+ if (canBuildBroadcast) {
+ estimatePruningSideSize > overhead
+ } else {
+ // We need to ensure that the otherPlan can collect to the driver.
+ canBroadcastBySize(otherPlan, SQLConf.get) && estimatePruningSideSize * 0.04 > overhead
Review comment:
This is to ensure that the pruning side is much lager, after all, we cannot reuse broadcast. For example.
If overhead = 10M
| canBuildBroadcast | canNotBuildBroadcast
-- | -- | --
hasBenefit=true | estimatePruningSideSize > 20M | estimatePruningSideSize > 500M
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691606352
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-792232345
cc @maryannxue
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maryannxue commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side
Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r589586920
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +163,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
case _ => fallbackRatio
}
+ val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
// the pruning overhead is the total size in bytes of all scan relations
val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
- filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+ if (canBuildBroadcast) {
+ estimatePruningSideSize > overhead
+ } else {
+ // We need to ensure that the otherPlan can collect to the driver.
+ canBroadcastBySize(otherPlan, SQLConf.get) && estimatePruningSideSize * 0.04 > overhead
Review comment:
what is this `0.04`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org