You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/21 19:59:07 UTC

[GitHub] [spark] Swinky opened a new pull request #34062: remove extra DPP filters

Swinky opened a new pull request #34062:
URL: https://github.com/apache/spark/pull/34062


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Don't insert dynamic partition pruning filters in case the filters already referred statically. In case the filtering predicate on dimension table is in joinKey, no need to insert DPP filter in that case.
   
   Sample query:
   ```
   SELECT f.date_id, f.pid, f.sid FROM
   (select date_id, product_id as pid, store_id as sid from fact_stats) as f
   JOIN dim_stats s
   ON f.sid = s.store_id WHERE s.store_id = 3
   ```
   
   Without this PR DPP filter is inserted for above query despite of `store_id#4551 = 3` on fact_stats
   ```
     == Physical Plan ==
     *(2) Project [date_id#4548, pid#4631, sid#4632]
     +- *(2) BroadcastHashJoin [sid#4632], [store_id#4552], Inner, BuildRight, false
        :- *(2) Project [date_id#4548, product_id#4549 AS pid#4631, store_id#4551 AS sid#4632]
        :  +- *(2) ColumnarToRow
        :     +- FileScan parquet default.fact_stats[date_id#4548,product_id#4549,store_id#4551] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/D:/workspace/spark/sql/core/spark-warehouse/org.apache.spark.sql..., PartitionFilters: [(store_id#4551 = 3), isnotnull(store_id#4551), **dynamicpruningexpression(store_id#4551 IN dynamic...,** PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
        :           +- SubqueryBroadcast dynamicpruning#4636, 0, [store_id#4552], [id=#2461]
        :              +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#2460]
        :                 +- *(1) Filter (isnotnull(store_id#4552) AND (store_id#4552 = 3))
        :                    +- *(1) ColumnarToRow
        :                       +- FileScan parquet default.dim_stats[store_id#4552] Batched: true, DataFilters: [isnotnull(store_id#4552), (store_id#4552 = 3)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/D:/workspace/spark/sql/core/spark-warehouse/org.apache.spark.sql..., PartitionFilters: [], PushedFilters: [IsNotNull(store_id), EqualTo(store_id,3)], ReadSchema: struct<store_id:int>
        +- ReusedExchange [store_id#4552], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#2460]
   ```
   
   
   
   ### Why are the changes needed?
   Having redundant dynamic filters can have  unnecessary overheads: extra filtering overhead + in case of DPP subquery case, subquery execution overhead.
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Existing UTs pass. added new test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925925333


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48061/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927516518


   **[Test build #143638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143638/testReport)** for PR 34062 at commit [`f61f683`](https://github.com/apache/spark/commit/f61f68355ad06fad9ae893b97e84d3d97fe562ac).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925887517


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48061/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924478919


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714259017



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,28 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective. If the filtering predicate is on join
+   * key then partition filter can be inferred statically in optimization phase, hence return false.
    */
-  private def isLikelySelective(e: Expression): Boolean = e match {
-    case Not(expr) => isLikelySelective(expr)
-    case And(l, r) => isLikelySelective(l) || isLikelySelective(r)
-    case Or(l, r) => isLikelySelective(l) && isLikelySelective(r)
-    case _: StringRegexExpression => true
-    case _: BinaryComparison => true
-    case _: In | _: InSet => true
-    case _: StringPredicate => true
-    case _: MultiLikeBase => true
+  private def isLikelySelective(e: Expression, joinKey: Expression): Boolean = e match {
+    case Not(expr) => isLikelySelective(expr, joinKey)
+    case And(l, r) => isLikelySelective(l, joinKey) || isLikelySelective(r, joinKey)
+    case Or(l, r) => isLikelySelective(l, joinKey) && isLikelySelective(r, joinKey)
+    case expr: StringRegexExpression => true && !expr.references.subsetOf(joinKey.references)
+    case expr: BinaryComparison => true && !expr.references.subsetOf(joinKey.references)
+    case expr: In => true && !expr.references.subsetOf(joinKey.references)
+    case expr: InSet => true && !expr.references.subsetOf(joinKey.references)
+    case expr: StringPredicate => true && !expr.references.subsetOf(joinKey.references)
+    case expr: MultiLikeBase => true && !expr.references.subsetOf(joinKey.references)

Review comment:
       @viirya so is it safe to say predicates defined in this method are the supported ones? org.apache.spark.sql.execution.datasources.DataSourceStrategy#translateLeafNodeFilter
   
   If can just make y change for supported filters.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714135815



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -282,13 +284,13 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
             // otherwise the pruning will not trigger
             var filterableScan = getFilterableTableScan(l, left)
             if (filterableScan.isDefined && canPruneLeft(joinType) &&
-                hasPartitionPruningFilter(right)) {
+                hasPartitionPruningFilter(right, r)) {

Review comment:
       I couldn't understand how can we do this filtering here, this should be done recursively on  all the filters in `right` plan. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927679864


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143638/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927563339


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48150/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-926061861


   **[Test build #143552 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143552/testReport)** for PR 34062 at commit [`9d73a79`](https://github.com/apache/spark/commit/9d73a797e1fae0a08d7fb955f0d74b2855d2854a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925436710


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143517/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925436710


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143517/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925114992


   **[Test build #143509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143509/testReport)** for PR 34062 at commit [`8f32ebb`](https://github.com/apache/spark/commit/8f32ebb4747350cfdc4720e424e8449c399030fb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: remove extra DPP filters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924340342


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924380687


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47995/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925362263


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143509/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-926077805


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143552/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924799803


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r713398498



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,28 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective. If the filtering predicate is on join
+   * key then partition filter can be inferred statically in optimization phase, hence return false.
    */
-  private def isLikelySelective(e: Expression): Boolean = e match {
-    case Not(expr) => isLikelySelective(expr)
-    case And(l, r) => isLikelySelective(l) || isLikelySelective(r)
-    case Or(l, r) => isLikelySelective(l) && isLikelySelective(r)
-    case _: StringRegexExpression => true
-    case _: BinaryComparison => true
-    case _: In | _: InSet => true
-    case _: StringPredicate => true
-    case _: MultiLikeBase => true
+  private def isLikelySelective(e: Expression, joinKey: Expression): Boolean = e match {
+    case Not(expr) => isLikelySelective(expr, joinKey)
+    case And(l, r) => isLikelySelective(l, joinKey) || isLikelySelective(r, joinKey)
+    case Or(l, r) => isLikelySelective(l, joinKey) && isLikelySelective(r, joinKey)
+    case expr: StringRegexExpression => true && !expr.references.subsetOf(joinKey.references)

Review comment:
       true seems redundant? just `!expr.references.subsetOf(joinKey.references)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924348331


   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925362263


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143509/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714140966



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
##########
@@ -1543,6 +1543,70 @@ abstract class DynamicPartitionPruningSuiteBase
       checkAnswer(df, Row(1150, 1) :: Row(1130, 4) :: Row(1140, 4) :: Nil)
     }
   }
+
+  test("No dynamic partition pruning filters in case of static partition filtering") {
+    // join on single attribute
+      val sqlStr =

Review comment:
       corrected, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924752387


   **[Test build #143504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143504/testReport)** for PR 34062 at commit [`496587e`](https://github.com/apache/spark/commit/496587ef221980f2f3bbc46d8dbf2f328ed91aea).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925157948


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143504/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925925333


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48061/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r713384677



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
##########
@@ -1543,6 +1543,70 @@ abstract class DynamicPartitionPruningSuiteBase
       checkAnswer(df, Row(1150, 1) :: Row(1130, 4) :: Row(1140, 4) :: Nil)
     }
   }
+
+  test("No dynamic partition pruning filters in case of static partition filtering") {

Review comment:
       Could you add `SPARK-36819:` prefix to the test case name?
   ```
   - test("No dynamic partition pruning filters in case of static partition filtering") {
   + test("SPARK-36819: No dynamic partition pruning filters in case of static partition filtering") {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925206067


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924862976


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924419721


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47995/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924749497


   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714140233



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
##########
@@ -1543,6 +1543,70 @@ abstract class DynamicPartitionPruningSuiteBase
       checkAnswer(df, Row(1150, 1) :: Row(1130, 4) :: Row(1140, 4) :: Nil)
     }
   }
+
+  test("No dynamic partition pruning filters in case of static partition filtering") {

Review comment:
       added, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924752255


   do you mean `InferFiltersFromConstraints` can generate static partition predicates and we don't need to trigger DPP in that case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925435233


   **[Test build #143517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143517/testReport)** for PR 34062 at commit [`ef8a920`](https://github.com/apache/spark/commit/ef8a920e9cebbb0b14a787241b46f19cba46a13c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714259017



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,28 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective. If the filtering predicate is on join
+   * key then partition filter can be inferred statically in optimization phase, hence return false.
    */
-  private def isLikelySelective(e: Expression): Boolean = e match {
-    case Not(expr) => isLikelySelective(expr)
-    case And(l, r) => isLikelySelective(l) || isLikelySelective(r)
-    case Or(l, r) => isLikelySelective(l) && isLikelySelective(r)
-    case _: StringRegexExpression => true
-    case _: BinaryComparison => true
-    case _: In | _: InSet => true
-    case _: StringPredicate => true
-    case _: MultiLikeBase => true
+  private def isLikelySelective(e: Expression, joinKey: Expression): Boolean = e match {
+    case Not(expr) => isLikelySelective(expr, joinKey)
+    case And(l, r) => isLikelySelective(l, joinKey) || isLikelySelective(r, joinKey)
+    case Or(l, r) => isLikelySelective(l, joinKey) && isLikelySelective(r, joinKey)
+    case expr: StringRegexExpression => true && !expr.references.subsetOf(joinKey.references)
+    case expr: BinaryComparison => true && !expr.references.subsetOf(joinKey.references)
+    case expr: In => true && !expr.references.subsetOf(joinKey.references)
+    case expr: InSet => true && !expr.references.subsetOf(joinKey.references)
+    case expr: StringPredicate => true && !expr.references.subsetOf(joinKey.references)
+    case expr: MultiLikeBase => true && !expr.references.subsetOf(joinKey.references)

Review comment:
       @viirya so is it safe to say predicates defined in this method are the supported ones? org.apache.spark.sql.execution.datasources.DataSourceStrategy#translateLeafNodeFilter
   thanks for pointing out, I can just make y change for supported filters.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714422174



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,46 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective in dynamic partition filtering.
+   * 1. the predicate is selective.
+   * 2. the filtering predicate must not be subset of join key. In case it is, key then partition
+   * filter. can be inferred statically in optimization phase, hence return false.

Review comment:
       Sorry, I got confused here. Do you mean `In case it is, then partition filter can be inferred ...`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927563339


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48150/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924862976


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r780403353



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -216,11 +216,53 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Search a filtering predicate in a given logical plan
+   * Returns whether an expression is likely to be selective in dynamic partition filtering.
+   * 1. the predicate is selective.
+   * 2. the filtering predicate must not be subset of join key. In case it is, then partition
+   * filter can be inferred statically in optimization phase, hence return false.
    */
-  private def hasSelectivePredicate(plan: LogicalPlan): Boolean = {
+  private def isLikelySelectiveWithInferFiltersEnabled(
+      e: Expression,
+      joinKey: Expression): (Boolean, Boolean) = e match {
+    case Not(expr) => isLikelySelectiveWithInferFiltersEnabled(expr, joinKey)
+    case And(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) =
+        isLikelySelectiveWithInferFiltersEnabled(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) =
+        isLikelySelectiveWithInferFiltersEnabled(r, joinKey)
+      (isSelectiveLeft || isSelectiveRight, notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case Or(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) =
+        isLikelySelectiveWithInferFiltersEnabled(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) =
+        isLikelySelectiveWithInferFiltersEnabled(r, joinKey)
+      (isSelectiveLeft && isSelectiveRight && (notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight),
+        notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case expr: StringRegexExpression => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: BinaryComparison => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: In => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: InSet => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: StringPredicate => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: MultiLikeBase => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case _ => (false, false)
+  }
+
+  private def isSubsetOfJoinKey(e: Expression, joinKey: Expression): Boolean =
+      e.references.subsetOf(joinKey.references)
+
+  /**
+   * Search a filtering predicate in a given logical plan.
+   */
+  private def hasSelectivePredicate(plan: LogicalPlan, joinKey: Expression): Boolean = {

Review comment:
       Do we need to make this so complicated? This is one of the last rules applied and by this time all static filters have been implied and pushed down? Can't we simply test if there's any static equality-condition filters on the same partition column that the dynamic filter would apply on?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924752387


   **[Test build #143504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143504/testReport)** for PR 34062 at commit [`496587e`](https://github.com/apache/spark/commit/496587ef221980f2f3bbc46d8dbf2f328ed91aea).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-926077805


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143552/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925306586


   **[Test build #143517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143517/testReport)** for PR 34062 at commit [`ef8a920`](https://github.com/apache/spark/commit/ef8a920e9cebbb0b14a787241b46f19cba46a13c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924355870


   Also, cc @sunchao and @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925925233


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48061/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925149277


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924419752


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47995/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924419752


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47995/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924348936


   **[Test build #143484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143484/testReport)** for PR 34062 at commit [`496587e`](https://github.com/apache/spark/commit/496587ef221980f2f3bbc46d8dbf2f328ed91aea).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925350546


   **[Test build #143509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143509/testReport)** for PR 34062 at commit [`8f32ebb`](https://github.com/apache/spark/commit/8f32ebb4747350cfdc4720e424e8449c399030fb).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925369397


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48027/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714823209



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,46 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective in dynamic partition filtering.
+   * 1. the predicate is selective.
+   * 2. the filtering predicate must not be subset of join key. In case it is, key then partition
+   * filter. can be inferred statically in optimization phase, hence return false.
    */
-  private def isLikelySelective(e: Expression): Boolean = e match {
-    case Not(expr) => isLikelySelective(expr)
-    case And(l, r) => isLikelySelective(l) || isLikelySelective(r)
-    case Or(l, r) => isLikelySelective(l) && isLikelySelective(r)
-    case _: StringRegexExpression => true
-    case _: BinaryComparison => true
-    case _: In | _: InSet => true
-    case _: StringPredicate => true
-    case _: MultiLikeBase => true
-    case _ => false
+  private def isLikelySelective(e: Expression, joinKey: Expression): (Boolean, Boolean) = e match {
+    case Not(expr) => isLikelySelective(expr, joinKey)
+    case And(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) = isLikelySelective(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) = isLikelySelective(r, joinKey)
+      (isSelectiveLeft || isSelectiveRight, notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case Or(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) = isLikelySelective(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) = isLikelySelective(r, joinKey)
+      (isSelectiveLeft && isSelectiveRight && (notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight),
+        notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case expr: StringRegexExpression => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: BinaryComparison => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: In => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: InSet => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: StringPredicate => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: MultiLikeBase => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case _ => (false, false)
   }
+   private def isSubsetOfJoinKey(e: Expression, joinKey: Expression): Boolean = {

Review comment:
       done, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925851325


   **[Test build #143552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143552/testReport)** for PR 34062 at commit [`9d73a79`](https://github.com/apache/spark/commit/9d73a797e1fae0a08d7fb955f0d74b2855d2854a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924340342


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924654596


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714431072



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -282,13 +284,13 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
             // otherwise the pruning will not trigger
             var filterableScan = getFilterableTableScan(l, left)
             if (filterableScan.isDefined && canPruneLeft(joinType) &&
-                hasPartitionPruningFilter(right)) {
+                hasPartitionPruningFilter(right, r)) {

Review comment:
       Oh, nvm, I thought the filter predicate was passed in. It is not the case here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925108409


   > do you mean `InferFiltersFromConstraints` can generate static partition predicates and we don't need to trigger DPP in that case?
   
   @cloud-fan correct, examples below:
   
   dimTable `d` has columns (d1, d2...)
   factTable `f` has columns (f1, f2, f3...) partitioned on f1, f2.
   
   Example 1:
   ```
   
   			join(d1=f1)
   		       /	  \
   	    Filter(d1=100)   FactTable(f)
   	  	    |				
   	        dimTable(d)
   ```
   
   	 PartitionFilters for FactTable: [f1=100, f1 in dpp-subquery] // "f1=100" here is inferred in `InferFiltersFromConstraints`
   	 After Proposed change: [f1=100]
   
   
   Example 2:
   ```
   	 	join(d1=f1, d2=f2)
   		/		\
   	Filter(d1=100)	  FactTable(f)
   	  	|				
   	 dimTable(d)
   ```
   
   	 PartitionFilters for FactTable now: [f1=100, f1 in (d1 values from dpp-subquery1), f2 in (d2 values from dpp-subquery1)] // "f1=100" here is inferred in `InferFiltersFromConstraints`
   	 After Proposed change: [f1=100, f2 in (d2 values from dpp-subquery1)]
   
   
   Example 3:
   ```
   	 			join(d1=f1, d2=f2)
   				/		\
   	        Filter(d1=100 || d3=200)	 FactTable(f)
   	  	            |				
   	               dimTable(d)
   ```
   
   	 PartitionFilters for FactTable now: [f1 in (d1 values from dpp-subquery1), f2 in (d2 values from dpp-subquery1)]
   	 After Proposed change: No change in this case as the filter references in the filter are not a subset of d1 nor it is subset of d3.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925369366


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48027/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927513843


   @viirya @huaxingao @cloud-fan thanks for review, I have addressed the comments. Could you please have another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927532935


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48150/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927558177


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48150/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r713402178



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
##########
@@ -1543,6 +1543,70 @@ abstract class DynamicPartitionPruningSuiteBase
       checkAnswer(df, Row(1150, 1) :: Row(1130, 4) :: Row(1140, 4) :: Nil)
     }
   }
+
+  test("No dynamic partition pruning filters in case of static partition filtering") {
+    // join on single attribute
+      val sqlStr =

Review comment:
       Indent looks wrong.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924844296


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48015/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927679864


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143638/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927516518


   **[Test build #143638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143638/testReport)** for PR 34062 at commit [`f61f683`](https://github.com/apache/spark/commit/f61f68355ad06fad9ae893b97e84d3d97fe562ac).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925150988


   **[Test build #143504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143504/testReport)** for PR 34062 at commit [`496587e`](https://github.com/apache/spark/commit/496587ef221980f2f3bbc46d8dbf2f328ed91aea).
    * This patch **fails from timeout after a configured wait of `500m`**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925306586


   **[Test build #143517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143517/testReport)** for PR 34062 at commit [`ef8a920`](https://github.com/apache/spark/commit/ef8a920e9cebbb0b14a787241b46f19cba46a13c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925114992


   **[Test build #143509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143509/testReport)** for PR 34062 at commit [`8f32ebb`](https://github.com/apache/spark/commit/8f32ebb4747350cfdc4720e424e8449c399030fb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925222151


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-1006354184


   cc @maryannxue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-1006178596


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925369397


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48027/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925222151


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924492872


   **[Test build #143484 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143484/testReport)** for PR 34062 at commit [`496587e`](https://github.com/apache/spark/commit/496587ef221980f2f3bbc46d8dbf2f328ed91aea).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714278505



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,28 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective. If the filtering predicate is on join
+   * key then partition filter can be inferred statically in optimization phase, hence return false.
    */
-  private def isLikelySelective(e: Expression): Boolean = e match {
-    case Not(expr) => isLikelySelective(expr)
-    case And(l, r) => isLikelySelective(l) || isLikelySelective(r)
-    case Or(l, r) => isLikelySelective(l) && isLikelySelective(r)
-    case _: StringRegexExpression => true
-    case _: BinaryComparison => true
-    case _: In | _: InSet => true
-    case _: StringPredicate => true
-    case _: MultiLikeBase => true
+  private def isLikelySelective(e: Expression, joinKey: Expression): Boolean = e match {
+    case Not(expr) => isLikelySelective(expr, joinKey)
+    case And(l, r) => isLikelySelective(l, joinKey) || isLikelySelective(r, joinKey)
+    case Or(l, r) => isLikelySelective(l, joinKey) && isLikelySelective(r, joinKey)
+    case expr: StringRegexExpression => true && !expr.references.subsetOf(joinKey.references)
+    case expr: BinaryComparison => true && !expr.references.subsetOf(joinKey.references)
+    case expr: In => true && !expr.references.subsetOf(joinKey.references)
+    case expr: InSet => true && !expr.references.subsetOf(joinKey.references)
+    case expr: StringPredicate => true && !expr.references.subsetOf(joinKey.references)
+    case expr: MultiLikeBase => true && !expr.references.subsetOf(joinKey.references)

Review comment:
       `translateLeafNodeFilter` is used for data filtering pushdown. Seems for the static partition pruning, the predicates are not translated by it but evaluated in `FileIndex`, or delegated to `ExternalCatalog` (e.g. `HiveExternalCatalog` will translate to Hive's predicate for pruning).
   
   At `FileIndex`, it is evaluated by Spark, so it is fine.
   
   I think even here some predicates are not supported at `ExternalCatalog`, it should be orthogonal to this issue.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-927664418


   **[Test build #143638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143638/testReport)** for PR 34062 at commit [`f61f683`](https://github.com/apache/spark/commit/f61f68355ad06fad9ae893b97e84d3d97fe562ac).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714422369



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,46 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective in dynamic partition filtering.
+   * 1. the predicate is selective.
+   * 2. the filtering predicate must not be subset of join key. In case it is, key then partition
+   * filter. can be inferred statically in optimization phase, hence return false.
    */
-  private def isLikelySelective(e: Expression): Boolean = e match {
-    case Not(expr) => isLikelySelective(expr)
-    case And(l, r) => isLikelySelective(l) || isLikelySelective(r)
-    case Or(l, r) => isLikelySelective(l) && isLikelySelective(r)
-    case _: StringRegexExpression => true
-    case _: BinaryComparison => true
-    case _: In | _: InSet => true
-    case _: StringPredicate => true
-    case _: MultiLikeBase => true
-    case _ => false
+  private def isLikelySelective(e: Expression, joinKey: Expression): (Boolean, Boolean) = e match {
+    case Not(expr) => isLikelySelective(expr, joinKey)
+    case And(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) = isLikelySelective(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) = isLikelySelective(r, joinKey)
+      (isSelectiveLeft || isSelectiveRight, notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case Or(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) = isLikelySelective(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) = isLikelySelective(r, joinKey)
+      (isSelectiveLeft && isSelectiveRight && (notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight),
+        notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case expr: StringRegexExpression => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: BinaryComparison => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: In => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: InSet => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: StringPredicate => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: MultiLikeBase => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case _ => (false, false)
   }
+   private def isSubsetOfJoinKey(e: Expression, joinKey: Expression): Boolean = {

Review comment:
       nit: add an empty line in between of the two methods? Also, is that a three space indent? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714823618



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,46 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective in dynamic partition filtering.
+   * 1. the predicate is selective.
+   * 2. the filtering predicate must not be subset of join key. In case it is, key then partition
+   * filter. can be inferred statically in optimization phase, hence return false.

Review comment:
       corrected




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r780403353



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -216,11 +216,53 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Search a filtering predicate in a given logical plan
+   * Returns whether an expression is likely to be selective in dynamic partition filtering.
+   * 1. the predicate is selective.
+   * 2. the filtering predicate must not be subset of join key. In case it is, then partition
+   * filter can be inferred statically in optimization phase, hence return false.
    */
-  private def hasSelectivePredicate(plan: LogicalPlan): Boolean = {
+  private def isLikelySelectiveWithInferFiltersEnabled(
+      e: Expression,
+      joinKey: Expression): (Boolean, Boolean) = e match {
+    case Not(expr) => isLikelySelectiveWithInferFiltersEnabled(expr, joinKey)
+    case And(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) =
+        isLikelySelectiveWithInferFiltersEnabled(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) =
+        isLikelySelectiveWithInferFiltersEnabled(r, joinKey)
+      (isSelectiveLeft || isSelectiveRight, notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case Or(l, r) =>
+      val (isSelectiveLeft, notSubsetOfJoinKeyLeft) =
+        isLikelySelectiveWithInferFiltersEnabled(l, joinKey)
+      val (isSelectiveRight, notSubsetOfJoinKeyRight) =
+        isLikelySelectiveWithInferFiltersEnabled(r, joinKey)
+      (isSelectiveLeft && isSelectiveRight && (notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight),
+        notSubsetOfJoinKeyLeft || notSubsetOfJoinKeyRight)
+    case expr: StringRegexExpression => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: BinaryComparison => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: In => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: InSet => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: StringPredicate => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case expr: MultiLikeBase => (true, !isSubsetOfJoinKey(expr, joinKey))
+    case _ => (false, false)
+  }
+
+  private def isSubsetOfJoinKey(e: Expression, joinKey: Expression): Boolean =
+      e.references.subsetOf(joinKey.references)
+
+  /**
+   * Search a filtering predicate in a given logical plan.
+   */
+  private def hasSelectivePredicate(plan: LogicalPlan, joinKey: Expression): Boolean = {

Review comment:
       Do we need to make this so complicated? This is one of the last rules applied and by this time all static filters have been implied and pushed down? Can't we simply test if there's any static equality-condition filters on the same partition column that the dynamic filter would apply on?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924478919






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924654596


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924348936


   **[Test build #143484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143484/testReport)** for PR 34062 at commit [`496587e`](https://github.com/apache/spark/commit/496587ef221980f2f3bbc46d8dbf2f328ed91aea).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r713632359



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -282,13 +284,13 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
             // otherwise the pruning will not trigger
             var filterableScan = getFilterableTableScan(l, left)
             if (filterableScan.isDefined && canPruneLeft(joinType) &&
-                hasPartitionPruningFilter(right)) {
+                hasPartitionPruningFilter(right, r)) {

Review comment:
       Can we filter out the predicates which use references on joining key when passing into `hasPartitionPruningFilter`? Then we don't need to change `hasPartitionPruningFilter`, `hasSelectivePredicate`, etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Swinky commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
Swinky commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r714135815



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -282,13 +284,13 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
             // otherwise the pruning will not trigger
             var filterableScan = getFilterableTableScan(l, left)
             if (filterableScan.isDefined && canPruneLeft(joinType) &&
-                hasPartitionPruningFilter(right)) {
+                hasPartitionPruningFilter(right, r)) {

Review comment:
       I couldn't understand how can we do this filtering here, this should be done recursively on  all the filters in `right` plan, hence this is checked in `hasSelectivePredicate`. Could you please elaborate in case I am missing something.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925157948


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143504/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925851325


   **[Test build #143552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143552/testReport)** for PR 34062 at commit [`9d73a79`](https://github.com/apache/spark/commit/9d73a797e1fae0a08d7fb955f0d74b2855d2854a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924348999


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34062:
URL: https://github.com/apache/spark/pull/34062#discussion_r713434031



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -201,26 +201,28 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join
   }
 
   /**
-   * Returns whether an expression is likely to be selective
+   * Returns whether an expression is likely to be selective. If the filtering predicate is on join
+   * key then partition filter can be inferred statically in optimization phase, hence return false.
    */
-  private def isLikelySelective(e: Expression): Boolean = e match {
-    case Not(expr) => isLikelySelective(expr)
-    case And(l, r) => isLikelySelective(l) || isLikelySelective(r)
-    case Or(l, r) => isLikelySelective(l) && isLikelySelective(r)
-    case _: StringRegexExpression => true
-    case _: BinaryComparison => true
-    case _: In | _: InSet => true
-    case _: StringPredicate => true
-    case _: MultiLikeBase => true
+  private def isLikelySelective(e: Expression, joinKey: Expression): Boolean = e match {
+    case Not(expr) => isLikelySelective(expr, joinKey)
+    case And(l, r) => isLikelySelective(l, joinKey) || isLikelySelective(r, joinKey)
+    case Or(l, r) => isLikelySelective(l, joinKey) && isLikelySelective(r, joinKey)
+    case expr: StringRegexExpression => true && !expr.references.subsetOf(joinKey.references)
+    case expr: BinaryComparison => true && !expr.references.subsetOf(joinKey.references)
+    case expr: In => true && !expr.references.subsetOf(joinKey.references)
+    case expr: InSet => true && !expr.references.subsetOf(joinKey.references)
+    case expr: StringPredicate => true && !expr.references.subsetOf(joinKey.references)
+    case expr: MultiLikeBase => true && !expr.references.subsetOf(joinKey.references)

Review comment:
       I only was little concerning about if all these predicates can be supported by the static partition pruning path. but yea by no means this should be duplicated with static path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-925343136


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48027/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34062: [SPARK-36819][SQL] Don't insert redundant filters in case static partition pruning can be done

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34062:
URL: https://github.com/apache/spark/pull/34062#issuecomment-924497219


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143484/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org