You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "mskapilks (via GitHub)" <gi...@apache.org> on 2023/03/03 07:16:00 UTC

[GitHub] [spark] mskapilks opened a new pull request, #40266: [SPARK-42660] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

mskapilks opened a new pull request, #40266:
URL: https://github.com/apache/spark/pull/40266

### What changes were proposed in this pull request?
We should run `InferFiltersFromConstraints` again after running `RewritePredicateSubquery` rule. `RewritePredicateSubquery `rewrite IN and EXISTS queries to LEFT SEMI/LEFT ANTI joins. But we don't infer filters for these newly generated joins. We noticed in TPCH 1TB q21 by inferring filter for these new joins, one `lineitem` table scan can be reduced as `ReusedExchange` got introduce. Previously due to mismatch in filter predicates reuse was not happening.

### Why are the changes needed?
Can improve query performance.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PlanStability test

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1144333202


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   These two queries don't need NWAJ now due to more inferred filters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1453097749

   TPCH q21 plan change
   |Before|After|
   |------|----|
   |![Web capture_3-3-2023_125513_ms web azuresynapse net_crop](https://user-images.githubusercontent.com/106726387/222658668-830d9ddb-2cc2-4013-8dbc-29967d957954.jpeg)|![Web capture_3-3-2023_12324_ms web azuresynapse net_crop](https://user-images.githubusercontent.com/106726387/222658801-f917a4cd-f825-4567-aca8-cb596dd64e51.jpeg)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1453149163

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145715046


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   Plan for this query before this change:
   
   ```
   Join LeftAnti, ((key#13 = a#23) OR isnull((key#13 = a#23)))
   :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14]
   :  +- ExternalRDD [obj#12]
   +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23]
      +- ExternalRDD [obj#22]
   ```
   
   New plan
   ```
   Join LeftAnti, (key#13 = a#23)
   :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14]
   :  +- ExternalRDD [obj#12]
   +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23]
      +- ExternalRDD [obj#22]
   ```
   
   `isnull((key#13 = a#23))` condition got removed by `NullPropagation` rule (as now all optimization rules will run after subquery rewrite).
   
   So now the join does get convert to Null Aware Anti Join as that's only happens when condition like previous plan exists. `LeftAnti(condition: Or(EqualTo(a=b), IsNull(EqualTo(a=b)))`  [Code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L403)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145908403


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   Hm, then I think we need to fix the test query (and not the expected result) as `not in` can't be rewritten to a simple (not null-aware) `BroadcastHashJoinExec` if we don't know the `key`'s and `a`'s nullability. I think the problem here is that we use `TestData` and `TestData2` where `key` and `a` are `Int`s and not `Integer`s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1469424781

   @wangyum @peter-toth Thanks for pointing on previous attempts.
   
   It does seem moving `RewritePredicateSubquery` rule is right way so that in future we don't add anymore rule to that batch (`RewriteSubquery`).
   
   In this pr https://github.com/apache/spark/pull/17520, they tried to put RewritePredicateSubquery right after `Subquery` batch (of `OptimizeSubqueries`). `operatorOptimizationBatch` will run after this.  They also added one rule to push LeftSemi/LeftAnti through join, but that has been added in 3.0 by [SPARK-19712](https://issues.apache.org/jira/browse/SPARK-19712). So now we only need to change rule position.
   
   If this seems right to you guys, I can update this PR to move `RewritePredicateSubquery` after `Subqury` batch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1144487954


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       var joinExec = assertJoin((
         "select * from testData where key not in (select a from testData2)",
         classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)

Review Comment:
   Can you please ellaborate on this a bit more?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1517796434

   @mskapilks, do you have any update on this? I can to take over this PR and investigate the idea further if you don't have time for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145917715


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   `Filter (46)`'s condition is `exists#2 OR exists#1`, that can't be pushed down. But that's ok as it is basically the same as the old `Filter (30)` was.
   In the new plan the order of joins are a bit different, but I'm not sure the new plan would be worse. Actually we have 3 SMJ + 5 BHJ now whereas we had 4 + 4...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145744109


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   Seems `PushLeftSemiLeftAntiThroughJoin` `PushDownLeftSemiAntiJoin` doesn't consider ExistenceJoin. Might need to update these rules or do predicate pushdown before subquery rewrite (this may not be ideal)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1461267270

   cc: @wangyum @peter-toth 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145946019


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   Can you please run a TPCDS benchmark to make sure we don't introduce performance regression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1462138771

   This change makes sense to me and new plans look ok to me.
   However, seemingly `InferFiltersFromConstraints` has a dedicated place in the optimizer and so there are 2 special batches `Operator Optimization before Inferring Filters` and `Operator Optimization after Inferring Filters` before and after the rule to make sure the inferred filtes are optimized. It also seems like the `RewriteSubquery` batch slowly becomes larger and larger with rules from those batches (see SPARK-39511, SPARK-22662, SPARK-36280). And now you want to add `InferFiltersFromConstraints` too. So I wonder if `RewritePredicateSubquery` is at the right place or what else would make sense to be executed after `RewritePredicateSubquery`? Maybe rerunning a full `operatorOptimizationBatch` would make sense despite it comes with a cost?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1657307928

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1476205668

   Looks like there are a few failures after moving the rule (https://github.com/apache/spark/pull/40266/commits/22e7886ff1059b98d1525380b2cb22718fd5dd09). @mskapilks, do you think you can look into those failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1480845859

   More failures. Seems this might take real effort to make it work like other rules modifications.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "wangyum (via GitHub)" <gi...@apache.org>.

wangyum commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1463196395

   I had a change like this before: https://github.com/apache/spark/pull/22778.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)
URL: https://github.com/apache/spark/pull/40266


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1463418560

   > I had a change like this before: https://github.com/apache/spark/pull/22778.
   
   Ah ok, thanks @wangyum! It looks like the very same discussuion has come up before: https://github.com/apache/spark/pull/22778#discussion_r229178084


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1477245713

   > Looks like there are a few failures after moving the rule ([22e7886](https://github.com/apache/spark/commit/22e7886ff1059b98d1525380b2cb22718fd5dd09)). @mskapilks, do you think you can look into those failures?
   
   Yup I am working on them. I had wrong SPARK_HOME setup so missed the plan changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mskapilks commented on a diff in pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "mskapilks (via GitHub)" <gi...@apache.org>.

mskapilks commented on code in PR #40266:
URL: https://github.com/apache/spark/pull/40266#discussion_r1145720537


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt:
##########
@@ -4,67 +4,67 @@ TakeOrderedAndProject (51)
    +- Exchange (49)
       +- * HashAggregate (48)
          +- * Project (47)
-            +- * SortMergeJoin Inner (46)
-               :- * Sort (40)
-               :  +- Exchange (39)
-               :     +- * Project (38)
-               :        +- * BroadcastHashJoin Inner BuildRight (37)
-               :           :- * Project (31)
-               :           :  +- * Filter (30)
-               :           :     +- * SortMergeJoin ExistenceJoin(exists#1) (29)
-               :           :        :- * SortMergeJoin ExistenceJoin(exists#2) (21)
-               :           :        :  :- * SortMergeJoin LeftSemi (13)
-               :           :        :  :  :- * Sort (5)
-               :           :        :  :  :  +- Exchange (4)
-               :           :        :  :  :     +- * Filter (3)
-               :           :        :  :  :        +- * ColumnarToRow (2)
-               :           :        :  :  :           +- Scan parquet spark_catalog.default.customer (1)
-               :           :        :  :  +- * Sort (12)
-               :           :        :  :     +- Exchange (11)
-               :           :        :  :        +- * Project (10)
-               :           :        :  :           +- * BroadcastHashJoin Inner BuildRight (9)
-               :           :        :  :              :- * ColumnarToRow (7)
-               :           :        :  :              :  +- Scan parquet spark_catalog.default.store_sales (6)
-               :           :        :  :              +- ReusedExchange (8)
-               :           :        :  +- * Sort (20)
-               :           :        :     +- Exchange (19)
-               :           :        :        +- * Project (18)
-               :           :        :           +- * BroadcastHashJoin Inner BuildRight (17)
-               :           :        :              :- * ColumnarToRow (15)
-               :           :        :              :  +- Scan parquet spark_catalog.default.web_sales (14)
-               :           :        :              +- ReusedExchange (16)
-               :           :        +- * Sort (28)
-               :           :           +- Exchange (27)
-               :           :              +- * Project (26)
-               :           :                 +- * BroadcastHashJoin Inner BuildRight (25)
-               :           :                    :- * ColumnarToRow (23)
-               :           :                    :  +- Scan parquet spark_catalog.default.catalog_sales (22)
-               :           :                    +- ReusedExchange (24)
-               :           +- BroadcastExchange (36)
-               :              +- * Project (35)
-               :                 +- * Filter (34)
-               :                    +- * ColumnarToRow (33)
-               :                       +- Scan parquet spark_catalog.default.customer_address (32)
-               +- * Sort (45)
-                  +- Exchange (44)
-                     +- * Filter (43)
-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)

Review Comment:
   Seems pushdown is not happening? Need to check this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1481013209

   > More failures. Seems this might take real effort to make it work like other rules modifications.
   
   Why was the latest commit (https://github.com/apache/spark/pull/40266/commits/b1ed7bee8b91faf8286963c7867eeed9f781b487) needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org