You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "mskapilks (via GitHub)" <gi...@apache.org> on 2023/03/03 07:16:00 UTC

[GitHub] [spark] mskapilks opened a new pull request, #40266: [SPARK-42660] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

mskapilks opened a new pull request, #40266:
URL: https://github.com/apache/spark/pull/40266

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   ### What changes were proposed in this pull request?
   We should run `InferFiltersFromConstraints` again after running `RewritePredicateSubquery` rule. `RewritePredicateSubquery `rewrite IN and EXISTS queries to LEFT SEMI/LEFT ANTI joins. But we don't infer filters for these newly generated joins. We noticed in TPCH 1TB q21 by inferring filter for these new joins, one `lineitem` table scan can be reduced as `ReusedExchange` got introduce. Previously due to mismatch in filter predicates reuse was not happening.
   
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   Can improve query performance.
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   PlanStability test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1144333202


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   These two queries don't need NWAJ now due to more inferred filters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1453097749

   TPCH q21 plan change
   |Before|After|
   |------|----|
   |![Web capture_3-3-2023_125513_ms web azuresynapse net_crop](https://user-images.githubusercontent.com/106726387/222658668-830d9ddb-2cc2-4013-8dbc-29967d957954.jpeg)|![Web capture_3-3-2023_12324_ms web azuresynapse net_crop](https://user-images.githubusercontent.com/106726387/222658801-f917a4cd-f825-4567-aca8-cb596dd64e51.jpeg)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1453149163

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145715046


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   Plan for this query before this change:
   
   ```
   Join LeftAnti, ((key#13 = a#23) OR isnull((key#13 = a#23)))
   :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14]
   :  +- ExternalRDD [obj#12]
   +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23]
      +- ExternalRDD [obj#22]
   ```
   
   New plan
   ```
   Join LeftAnti, (key#13 = a#23)
   :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14]
   :  +- ExternalRDD [obj#12]
   +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23]
      +- ExternalRDD [obj#22]
   ```
   
   `isnull((key#13 = a#23))` condition got removed by `NullPropagation` rule (as now all optimization rules will run after subquery rewrite).
   
   So now the join does get convert to Null Aware Anti Join as that's only happens when condition like previous plan exists. `LeftAnti(condition: Or(EqualTo(a=b), IsNull(EqualTo(a=b)))`  [Code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L403)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145908403


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   Hm, then I think we need to fix the test query (and not the expected result) as `not in` can't be rewritten to a simple (not null-aware) `BroadcastHashJoinExec` if we don't know the `key`'s and `a`'s nullability. I think the problem here is that we use `TestData` and `TestData2` where `key` and `a` are `Int`s and not `Integer`s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1469424781

   @wangyum @peter-toth Thanks for pointing on previous attempts.
   
   It does seem moving `RewritePredicateSubquery` rule is right way so that in future we don't add anymore rule to that batch (`RewriteSubquery`).
   
   In this pr https://github.com/apache/spark/pull/17520, they tried to put RewritePredicateSubquery right after `Subquery` batch (of `OptimizeSubqueries`). `operatorOptimizationBatch` will run after this.  They also added one rule to push LeftSemi/LeftAnti through join, but that has been added in 3.0 by [SPARK-19712](https://issues.apache.org/jira/browse/SPARK-19712). So now we only need to change rule position.
   
   If this seems right to you guys, I can update this PR to move `RewritePredicateSubquery` after `Subqury` batch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1144487954


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   Can you please ellaborate on this a bit more?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1517796434

   @mskapilks, do you have any update on this? I can to take over this PR and investigate the idea further if you don't have time for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145917715


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   `Filter (46)`'s condition is `exists#2 OR exists#1`, that can't be pushed down. But that's ok as it is basically the same as the old `Filter (30)` was.
   In the new plan the order of joins are a bit different, but I'm not sure the new plan would be worse. Actually we have 3 SMJ + 5 BHJ now whereas we had 4 + 4...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145744109


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   Seems `PushLeftSemiLeftAntiThroughJoin` `PushDownLeftSemiAntiJoin` doesn't consider ExistenceJoin. Might need to update these rules or do predicate pushdown before subquery rewrite (this may not be ideal)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1461267270

   cc: @wangyum @peter-toth 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145946019


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   Can you please run a TPCDS benchmark to make sure we don't introduce performance regression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1462138771

   This change makes sense to me and new plans look ok to me.
   However, seemingly `InferFiltersFromConstraints` has a dedicated place in the optimizer and so there are 2 special batches `Operator Optimization before Inferring Filters` and `Operator Optimization after Inferring Filters` before and after the rule to make sure the inferred filtes are optimized. It also seems like the `RewriteSubquery` batch slowly becomes larger and larger with rules from those batches (see SPARK-39511, SPARK-22662, SPARK-36280). And now you want to add `InferFiltersFromConstraints` too. So I wonder if `RewritePredicateSubquery` is at the right place or what else would make sense to be executed after `RewritePredicateSubquery`? Maybe rerunning a full `operatorOptimizationBatch` would make sense despite it comes with a cost?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1657307928

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1476205668

   Looks like there are a few failures after moving the rule (https://github.com/apache/spark/pull/40266/commits/22e7886ff1059b98d1525380b2cb22718fd5dd09). @mskapilks, do you think you can look into those failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1480845859

   More failures. Seems this might take real effort to make it work like other rules modifications.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1463196395

   I had a change like this before: https://github.com/apache/spark/pull/22778.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)
URL: https://github.com/apache/spark/pull/40266


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1463418560

   > I had a change like this before: https://github.com/apache/spark/pull/22778.
   
   Ah ok, thanks @wangyum! It looks like the very same discussuion has come up before: https://github.com/apache/spark/pull/22778#discussion_r229178084


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1477245713

   > Looks like there are a few failures after moving the rule ([22e7886](https://github.com/apache/spark/commit/22e7886ff1059b98d1525380b2cb22718fd5dd09)). @mskapilks, do you think you can look into those failures?
   
   Yup I am working on them. I had wrong SPARK_HOME setup so missed the plan changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.
mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145720537


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   Seems pushdown is not happening? Need to check this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1481013209

   > More failures. Seems this might take real effort to make it work like other rules modifications.
   
   Why was the latest commit (https://github.com/apache/spark/pull/40266/commits/b1ed7bee8b91faf8286963c7867eeed9f781b487) needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org