You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/17 03:02:42 UTC

[GitHub] [spark] wangyum opened a new pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

wangyum opened a new pull request #29726:
URL: https://github.com/apache/spark/pull/29726


   ### What changes were proposed in this pull request?
   
   For some filtering side can not broadcast by join type but can broadcast by size, then we should not consider reuse broadcast only, for example:
   
   Left outer join and left side very small.
   
   ### Why are the changes needed?
   
   Improve query performance for these cases:
   ![image](https://user-images.githubusercontent.com/5399861/92882197-445a2a00-f442-11ea-955d-16a7724e535b.png)
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781030094


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39795/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803644398


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136313/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807964047


   Thank you, @wangyum and all. I left a few minor comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807897409


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780332095


   **[Test build #135187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135187/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
    * This patch passes all tests.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-806940688


   **[Test build #136527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136527/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798199570


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136028/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum edited a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780266596


   ```scala
   sql("set spark.sql.adaptive.enabled=false")
   sql("CREATE TABLE t1 using parquet partitioned by (b) AS SELECT id AS a, id % 1000 AS b FROM range(200000000L) distribute by b")
   sql("CREATE TABLE t2 using parquet AS SELECT id as c,id AS d FROM range(2000)")
   sql("SELECT count(*) FROM t2 left join t1 on t1.b=t2.c where t2.d < 10").show
   ```
   Before this pr | After this pr
   -- | --
   ![image](https://user-images.githubusercontent.com/5399861/108152681-8d5e8a80-7114-11eb-9e9d-ca6bb5bb3e29.png)  | ![image](https://user-images.githubusercontent.com/5399861/108150840-59816600-7110-11eb-8b94-c2ad740d125d.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780319875


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39768/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807907173


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780383012


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39774/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781017325


   **[Test build #135214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135214/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780405252


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39774/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594


   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780296498


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39768/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543


   **[Test build #128558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128558/testReport)** for PR 29726 at commit [`2986aad`](https://github.com/apache/spark/commit/2986aad31a98b83397ce216346112e041663d9f4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803610725


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40895/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807964686


   **[Test build #136540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136540/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807907173


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807023785


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41111/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602034890



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -307,6 +307,17 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =
+    buildConf("spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio")
+    .internal()
+    .doc("When filtering side doesn't support broadcast by join type, and doing DPP means " +
+      "running an extra query that may have significant overhead. This config will be used " +
+      "as the extra filter ratio for computing the data size of the pruning side after DPP, " +
+      "in order to evaluate if it is worth adding an extra subquery as the pruning filter.")
+    .version("3.2.0")
+    .doubleConf

Review comment:
       Can we have `.checkValue` for the expected range, `0.0 <=` and `<= 1.0`? Or, are we allowing more possibility like `2.0`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807319957


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136527/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780405252


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39774/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781054181


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39795/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780270050


   **[Test build #135187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135187/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807969383


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136540/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r589591161



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -85,22 +86,36 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       filteringKey: Expression,
       filteringPlan: LogicalPlan,
       joinKeys: Seq[Expression],
-      partScan: LogicalRelation): LogicalPlan = {
-    val reuseEnabled = SQLConf.get.exchangeReuseEnabled
+      partScan: LogicalRelation,
+      canBuildBroadcast: Boolean): LogicalPlan = {
     val index = joinKeys.indexOf(filteringKey)
-    lazy val hasBenefit = pruningHasBenefit(pruningKey, partScan, filteringKey, filteringPlan)
-    if (reuseEnabled || hasBenefit) {
-      // insert a DynamicPruning wrapper to identify the subquery during query planning
+    lazy val hasBenefit = pruningHasBenefit(

Review comment:
       I don't quite get it here. `pruningHasBenefit` was meant to return "whether it's worth pruning even if there's no broadcast reuse, in other words, even if we have to run an extra subquery on its own", so why do we need to pass over `canBuildBroadcast` to this method?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570828


   **[Test build #128594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128594/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803610238


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40895/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29726:
URL: https://github.com/apache/spark/pull/29726


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691085751






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780348226


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135187/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803599521


   **[Test build #136313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136313/testReport)** for PR 29726 at commit [`1c0f2a0`](https://github.com/apache/spark/commit/1c0f2a046ab98746f45ce85b98b53f30b31481e9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807036912


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41111/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807969383


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136540/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807037016


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41111/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798195887


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40612/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780266596


   
   Before this pr | After this pr
   -- | --
   ![image](https://user-images.githubusercontent.com/5399861/108150821-4f5f6780-7110-11eb-9c9b-a8708ee2bbe9.png) | ![image](https://user-images.githubusercontent.com/5399861/108150840-59816600-7110-11eb-8b94-c2ad740d125d.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807901054


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780319875


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39768/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r598289449



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +151,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       case _ => fallbackRatio
     }
 
+    val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
     // the pruning overhead is the total size in bytes of all scan relations
     val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
-    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+    if (canBuildBroadcast) {
+      estimatePruningSideSize > overhead
+    } else {
+      // We need to make sure that otherPlan can be broadcast by size to avoid driver OOM
+      canBroadcastBySize(otherPlan, conf) && estimatePruningSideSize * 0.04 > overhead

Review comment:
       (pruning side stats size) * 0.04 * 0.5(spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio) > 10M(spark.sql.autoBroadcastJoinThreshold)
   
   The (pruning side stats size) around 500M.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r577353386



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -196,6 +198,25 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
     case _ => false
   }
 
+  /**
+   * For some filtering side can not broadcast by join type but can broadcast by size,
+   * then we should not consider reuse broadcast only, for example:
+   * Left outer join and left side very small.
+   */
+  private def isReuseBroadcastOnly(
+      canBuildBroadcast: Boolean,
+      hasBenefit: Boolean,
+      hintToBroadcastPruningSide: Boolean,
+      pruningPlan: LogicalPlan,
+      filteringPlan: LogicalPlan): Boolean = {
+    if (canBuildBroadcast) {
+      !hasBenefit || SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly
+    } else {
+      !(hasBenefit && canBroadcastBySize(filteringPlan, SQLConf.get) &&
+        !hintToBroadcastPruningSide && !canBroadcastBySize(pruningPlan, SQLConf.get))

Review comment:
       Yes. To avoid driver OOM.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602035906



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -146,10 +150,18 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       case _ => fallbackRatio
     }
 
+    val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
     // the pruning overhead is the total size in bytes of all scan relations
     val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
-    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+    if (canBuildBroadcast) {
+      estimatePruningSideSize > overhead
+    } else {
+      // We can't reuse the broadcast because the join type doesn't support broadcast,
+      // and doing DPP means running an extra query that may have significant overhead.
+      // We need to make sure the pruning side is very big so that DPP is still worthy.
+      canBroadcastBySize(otherPlan, conf) &&
+        estimatePruningSideSize * conf.dynamicPartitionPruningPruningSideExtraFilterRatio > overhead

Review comment:
       Could you update this function description according to this new configuration, please? For now, we are mentioning only `spark.sql.optimizer.joinFilterRatio`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798199570






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803610725


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40895/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798208482


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40612/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780348226


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135187/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602032483



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -307,6 +307,17 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =
+    buildConf("spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio")
+    .internal()

Review comment:
       Indentation: two more spaces?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570828


   **[Test build #128594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128594/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803599521


   **[Test build #136313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136313/testReport)** for PR 29726 at commit [`1c0f2a0`](https://github.com/apache/spark/commit/1c0f2a046ab98746f45ce85b98b53f30b31481e9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r600540752



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -146,10 +150,17 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       case _ => fallbackRatio
     }
 
+    val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
     // the pruning overhead is the total size in bytes of all scan relations
     val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
-    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+    if (canBuildBroadcast) {
+      estimatePruningSideSize > overhead
+    } else {
+      // We can't reuse the broadcast because the join type doesn't support broadcast,
+      // and doing DPP means running an extra query that may have significant overhead.
+      // We need to make the pruning side is very big so that DPP is still worthy.

Review comment:
       `make` -> `make sure`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807829270


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-750674816


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781078108


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135214/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781078108


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135214/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807319957


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136527/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084594


   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807932345


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807293466


   **[Test build #136527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136527/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691084456


   **[Test build #128558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128558/testReport)** for PR 29726 at commit [`2986aad`](https://github.com/apache/spark/commit/2986aad31a98b83397ce216346112e041663d9f4).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r600541047



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +151,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       case _ => fallbackRatio
     }
 
+    val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
     // the pruning overhead is the total size in bytes of all scan relations
     val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
-    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+    if (canBuildBroadcast) {
+      estimatePruningSideSize > overhead
+    } else {
+      // We need to make sure that otherPlan can be broadcast by size to avoid driver OOM
+      canBroadcastBySize(otherPlan, conf) && estimatePruningSideSize * 0.04 > overhead

Review comment:
       can we make `0.04` configurable?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-806940688


   **[Test build #136527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136527/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807037016


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41111/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691606128


   **[Test build #128594 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128594/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781074426


   **[Test build #135214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135214/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780350692


   **[Test build #135193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135193/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29726:
URL: https://github.com/apache/spark/pull/29726


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570962






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807861127


   **[Test build #136540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136540/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780506569


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135193/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691606352






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum edited a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780266596


   ```scala
   sql("set spark.sql.adaptive.enabled=false")
   sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
   sql("CREATE TABLE t1 using parquet partitioned by (b) AS SELECT id AS a, id % 1000 AS b FROM range(200000000L) distribute by b")
   sql("CREATE TABLE t2 using parquet AS SELECT id as c,id AS d FROM range(2000)")
   sql("SELECT count(*) FROM t2 left join t1 on t1.b=t2.c where t2.d < 10").show
   ```
   Before this pr | After this pr
   -- | --
   ![image](https://user-images.githubusercontent.com/5399861/108152681-8d5e8a80-7114-11eb-9e9d-ca6bb5bb3e29.png)  | ![image](https://user-images.githubusercontent.com/5399861/108150840-59816600-7110-11eb-8b94-c2ad740d125d.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780374411


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39774/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803644066


   **[Test build #136313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136313/testReport)** for PR 29726 at commit [`1c0f2a0`](https://github.com/apache/spark/commit/1c0f2a046ab98746f45ce85b98b53f30b31481e9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r591660490



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -85,22 +86,36 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       filteringKey: Expression,
       filteringPlan: LogicalPlan,
       joinKeys: Seq[Expression],
-      partScan: LogicalRelation): LogicalPlan = {
-    val reuseEnabled = SQLConf.get.exchangeReuseEnabled
+      partScan: LogicalRelation,
+      canBuildBroadcast: Boolean): LogicalPlan = {
     val index = joinKeys.indexOf(filteringKey)
-    lazy val hasBenefit = pruningHasBenefit(pruningKey, partScan, filteringKey, filteringPlan)
-    if (reuseEnabled || hasBenefit) {
-      // insert a DynamicPruning wrapper to identify the subquery during query planning
+    lazy val hasBenefit = pruningHasBenefit(

Review comment:
       This is used to set `onlyInBroadcast` to false. Because `SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly ||! hasBenefit` is always true by default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r487866106



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -196,6 +198,25 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
     case _ => false
   }
 
+  /**
+   * For some filtering side can not broadcast by join type but can broadcast by size,
+   * then we should not consider reuse broadcast only, for example:
+   * Left outer join and left side very small.
+   */
+  private def isReuseBroadcastOnly(
+      canBuildBroadcast: Boolean,
+      hasBenefit: Boolean,
+      hintToBroadcastPruningSide: Boolean,
+      pruningPlan: LogicalPlan,
+      filteringPlan: LogicalPlan): Boolean = {
+    if (canBuildBroadcast) {
+      !hasBenefit || SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly
+    } else {
+      !(hasBenefit && canBroadcastBySize(filteringPlan, SQLConf.get) &&
+        !hintToBroadcastPruningSide && !canBroadcastBySize(pruningPlan, SQLConf.get))

Review comment:
       `!canBroadcastBySize(pruningPlan, SQLConf.get)` do we need this condition?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780270050


   **[Test build #135187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135187/testReport)** for PR 29726 at commit [`e9e50c4`](https://github.com/apache/spark/commit/e9e50c42cbd9444f34ada13aef67d3c1a1bad768).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r595823873



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +151,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       case _ => fallbackRatio
     }
 
+    val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
     // the pruning overhead is the total size in bytes of all scan relations
     val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
-    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+    if (canBuildBroadcast) {
+      estimatePruningSideSize > overhead
+    } else {
+      // We need to make sure that otherPlan can be broadcast by size to avoid driver OOM
+      canBroadcastBySize(otherPlan, conf) && estimatePruningSideSize * 0.04 > overhead

Review comment:
       We should explain the rationale clearly. The problem here is we can't reuse the broadcast (because the join type doesn't support broadcast), and doing DPP means running an extra query that may have significant overhead. We need to make the pruning side is very big so that DPP is still worthy.
   
   BTW how do you come up with 0.04?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #29726:
URL: https://github.com/apache/spark/pull/29726


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803644398


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136313/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807861127


   **[Test build #136540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136540/testReport)** for PR 29726 at commit [`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691085751






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691570962






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780302419


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39768/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690927120






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-690926543


   **[Test build #128558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128558/testReport)** for PR 29726 at commit [`2986aad`](https://github.com/apache/spark/commit/2986aad31a98b83397ce216346112e041663d9f4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781054181


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39795/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-803609317


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40895/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781017325


   **[Test build #135214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135214/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602032154



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -307,6 +307,17 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =

Review comment:
       `PARTITON` -> `PARTITION`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780483240


   **[Test build #135193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135193/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-798208542


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40612/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-781038279


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39795/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780506569


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135193/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-780350692


   **[Test build #135193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135193/testReport)** for PR 29726 at commit [`bc54840`](https://github.com/apache/spark/commit/bc54840259c9f342008dae4bf1295dec34b71865).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
maryannxue commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-792915413


   If I understand correctly (please make it clearer in your PR description), you are trying to enable DPP even if the filter is not broadcast-able but is small enough, I do not think it is a proper solution to hack through this "broadcastOnly" flag. Instead, you should improve the cost model in `pruningHasBenefit` in this PR and change the config `broadcastOnly` to `false` in your own application. And if you can further prove that with the improved cost model, you get performance improvement across the board, you can suggest flipping the default conf value in this PR or another follow-up one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r591671088



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +163,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       case _ => fallbackRatio
     }
 
+    val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
     // the pruning overhead is the total size in bytes of all scan relations
     val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
-    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+    if (canBuildBroadcast) {
+      estimatePruningSideSize > overhead
+    } else {
+      // We need to ensure that the otherPlan can collect to the driver.
+      canBroadcastBySize(otherPlan, SQLConf.get) && estimatePruningSideSize * 0.04 > overhead

Review comment:
       This is to ensure that the pruning side is much lager, after all, we cannot reuse broadcast. For example.
   
   If overhead = 10M
   
     | canBuildBroadcast | canNotBuildBroadcast
   -- | -- | --
   hasBenefit=true | estimatePruningSideSize > 20M | estimatePruningSideSize > 500M
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-691606352






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-792232345


   cc @maryannxue 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #29726: [SPARK-32855][SQL] Improve DPP for some join type do not support broadcast filtering side

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r589586920



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -147,10 +163,15 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper {
       case _ => fallbackRatio
     }
 
+    val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat
     // the pruning overhead is the total size in bytes of all scan relations
     val overhead = otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
-
-    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
+    if (canBuildBroadcast) {
+      estimatePruningSideSize > overhead
+    } else {
+      // We need to ensure that the otherPlan can collect to the driver.
+      canBroadcastBySize(otherPlan, SQLConf.get) && estimatePruningSideSize * 0.04 > overhead

Review comment:
       what is this `0.04`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org