You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/11 04:32:57 UTC

[GitHub] [spark] abellina opened a new pull request, #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

abellina opened a new pull request, #36505:
URL: https://github.com/apache/spark/pull/36505

### What changes were proposed in this pull request?

This PR seeks to allow exists (but also NOT exists, IN, and NOT IN) to be rewritten as the appropriate join by `RewritePredicateSubquery` before `InferFiltersFromConstraints` runs. This allows the join conditions to infer filters, such as null filtering that can be pushed down to the scan.

This particular example came about when executing a query derived from TPCDS q16 where we see an exists that is converted into a LeftSemi, but null filtering or push down filters don't exist, forcing Spark to load many null key rows that will not match (more details in the jira issue, with plans and a simpler repro case).

I am particularly weary of making changes in the optimizer as rules depend on the order and build on each other.

I copied (but believe I should move) `RewritePredicateSubquery` ahead of `InferFiltersFromConstraints`, which has the desired effect in my particular test case. I did not bring the whole `RewriteSubquery` batch with it, and that's my first question. Should all of `[RewriteSubquery](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L237)` move further up?

With the new join and filter (and push down) added earlier now, does anyone know if this could cause conflicts with rules that come after? I've spent some time reading through each of the rules and I am not convinced that a push down rule, or join reorder isn't going to be affected.

Any feedback would be appreciated.

### Why are the changes needed?
Without this change, LeftSemi/LeftAnti inserted because of (NOT) exists or (NOT) in, will not get null key filtering or push down benefits.

### Does this PR introduce _any_ user-facing change?
No, this is going to make some queries run faster.

### How was this patch tested?
- Manual execution of a query based on TPCDS q16
- Unit test was added to assert that the FilterExec is added. I am happy to add more tests.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1178972459

   I think ultimately we should rewrite correlated subqueries to joins at the very beginning of the optimizer. Also cc @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] abellina commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

abellina commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1129117318

   Update on the SPARK-32290: SingleColumn Null Aware Anti Join Optimize failure:
   
   - The original test used a table in the subquery `testData2` which has no nulls, so I added `testData2WithNulls` which brings nulls to the `a` column, which we are using in the `not in` statement. This helps the first query in this test to be planned as a null-aware.
   
   - All other queries in the test work, except for the negative case for the multi-column support. It is commented out in my last patch (obviously that's not the solution) https://github.com/apache/spark/pull/36505/commits/baac1e4119f755b3a906f27c4f7324022fc27e85#diff-4279d0074cd860be3eab329b3716ac502a14ae4512c3e371a5f0e68636cec07dR1163-R1166.  I believe this also has to do with the nullability, in this case of the second column `b`, and I am looking into it, but I could use some help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] abellina commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

abellina commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1129363308

   For the max iterations issue, it looks to be happening during `NestedColumnAliasing`, where in the test columns that are extracting a child (`_extract_name`) keep generating new aliases, with new `exprId`. It only happens when I add my rule, so I am causing it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] abellina commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

abellina commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1124014672

   Ok, as feared I am causing some issues here. Specifically:
   
   ```
    java.lang.RuntimeException: Max iterations (100) reached for batch Operator Optimization after Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a larger value.
   ```
   
   That seems to indicate that the added join/filter is causing this later batch to reach the fixedPoint limit for a few tests:
   
   - Spark vectorized reader - without partition data column - SPARK-38918: nested schema pruning with correlated subqueries *** FAILED ***
   - Spark vectorized reader - with partition data column - SPARK-38918: nested schema pruning with correlated subqueries *** FAILED ***
   - Non-vectorized reader - without partition data column - SPARK-38918: nested schema pruning with correlated subqueries *** FAILED ***
   - Non-vectorized reader - with partition data column - SPARK-38918: nested schema pruning with correlated subqueries *** FAILED ***
   - Case-insensitive parser - mixed-case schema - subquery filter with different-case column names *** FAILED ***
   
   There are unfortunately other errors as well:
   
   [info] - SPARK-32290: SingleColumn Null Aware Anti Join Optimize *** FAILED *** (15 milliseconds)
   [info]   joinExec.asInstanceOf[org.apache.spark.sql.execution.joins.BroadcastHashJoinExec].isNullAwareAntiJoin was false (JoinSuite.scala:1161)
   
   And then we have a number of logical plans that were expected to look like a checked in version in the repo, that have now changed. I am not entirely sure if it's ok to change those plans.
   
   I think I need to understand first how to actually fix this issue, before going off to fix most of the test failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gatorsmile commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

gatorsmile commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1170844167

   cc @sigmod  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1280110331

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] tgravescs commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

tgravescs commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1123794324

   @cloud-fan who is more familiar with optimizer .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] abellina commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

abellina commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1129153161

   > All other queries in the test are passing, except for the negative case for the multi-column support. It is commented out in my last patch (obviously that's not the solution) https://github.com/apache/spark/commit/baac1e4119f755b3a906f27c4f7324022fc27e85#diff-4279d0074cd860be3eab329b3716ac502a14ae4512c3e371a5f0e68636cec07dR1163-R1166. I believe this also has to do with the nullability, in this case of the second column b, and I am looking into it, but I could use some help.
   
   Ok, the reason for this is that the condition:
   ```
   (((key = a) OR isnull((key = a))) AND ((key + 1) = b))
   ```
   
   Is split into the equi-join part (`(key + 1) = b`) and the rest is turned into a conditional expression `(key = a) OR isnull((key = a))`  by `ExtractEquiJoinKeys`.
   
   If `b` were nullable, we wind up with a different condition:
   
   ```
   (((key = a) OR isnull((key = a))) AND (((key + 1) = b) OR isnull(((key + 1) = b))))
   ```
   
   This can't be split by `ExtractEquiJoinKeys`, so the join isn't an equi join so far. It goes on to `ExtractSingleColumnNullAwareAntiJoin` and passes through since it can't be matched with the single-column expression this rule is expecting => this is the negative case for the test. I'll change the table used to have nullable key,value columns in the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] abellina commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

abellina commented on PR #36505:
URL: https://github.com/apache/spark/pull/36505#issuecomment-1123980437

   I have some related test failures here so I'll take a look at that now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred 
URL: https://github.com/apache/spark/pull/36505


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org