You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/02 06:25:30 UTC

[GitHub] [spark] sperlingxx opened a new pull request #35708: [SPARK-38386][SQL] Combine compatible scalar subquery

sperlingxx opened a new pull request #35708:
URL: https://github.com/apache/spark/pull/35708


   ### What changes were proposed in this pull request?
   Current PR is to bring up a new optimizer rule `CombineScalarSubquery` to combine compatible scalar subqueries.
   
   The basic idea of  `CombineScalarSubquery`  is combining compatible aggregate plans served for scalar-subquery as much as possible. Aggregates sharing the same "base plan" with other aggregates are considered as compatible. This rule is supposed to combine all groups of compatible aggregates. And create a unified fused aggregate for each group, which can serve multiple scalar subqueries. The "base plan" refers to the top non-project descendant of the Aggregate, which means it could be either the child of Aggregate or the child of Aggregate(Project(...)).
   
   To be specific, `CombineScalarSubquery` consists of three steps:
   1. Collect combination candidates from scalar sub-queries. Assign these candidates into different groups according to the canonicalized base plan.
   2. Combine collected candidates of each group with an unified base plan to build fused aggregates.
   3. Replace ScalarSubqueries with SharedScalarSubqueries based on fused aggregates.
   
   Below is a sample query to illustrate how `CombineScalarSubquery` works as a optimizer rule:
   ```sql
   SELECT SUM(i) FROM t
   WHERE l > (SELECT MIN(l2) FROM t) AND l2 < (SELECT MAX(l) FROM t)
   AND AND i2 <> (SELECT MAX(i2) FROM t) AND i2 <> (SELECT MIN(i2) FROM t)
   ```
   Before the apply of `CombineScalarSubquery`, the optimized logicial plan looks like:
   ```
   Aggregate [sum(i)]
   +- Project [i]
     +- Filter (((l > scalar-subquery#1) AND (l2 < scalar-subquery#2)) AND (NOT (i2 = scalar-subquery#3) AND NOT (i2 = scalar-subquery#4)))
        :  :- Aggregate [min(l2)]
        :  :  +- Project [l2]
        :  :     +- Relation [l,l2,i,i2]
        :  +- Aggregate [max(l)]
        :     +- Project [l]
        :        +- Relation [l,l2,i,i2]
        :  +- Aggregate [max(i2)]
        :     +- Project [l]
        :        +- Relation [l,l2,i,i2]
        :  +- Aggregate [min(i2)]
        :     +- Project [l]
        :        +- Relation [l,l2,i,i2]
        +- Relation [l,l2,i,i2]
   ```
   After the apply of `CombineScalarSubquery`:
   ```
    Aggregate [sum(i)]
    +- Project [i]
      +- Filter (((l > shared-scalar-subquery#1) AND (l2 < shared-scalar-subquery#2)) AND (NOT (i2 = shared-scalar-subquery#3) AND NOT (i2 = shared-scalar-subquery#4)))
         :  :- Aggregate [min(l2),max(l),max(i2),min(i2)]
         :  :  +- Project [l2,l,i2]
         :  :     +- Relation [l,l2,i,i2]
         :  :- Aggregate [min(l2),max(l),max(i2),min(i2)]
         :  :  +- Project [l2,l,i2]
         :        +- Relation [l,l2,i,i2]
         :  :- Aggregate [min(l2),max(l),max(i2),min(i2)]
         :  :  +- Project [l2,l,i2]
         :        +- Relation [l,l2,i,i2]
         :  :- Aggregate [min(l2),max(l),max(i2),min(i2)]
         :  :  +- Project [l2,l,i2]
         :        +- Relation [l,l2,i,i2]
         +- Relation [l,l2,i,i2]
   ```
   In this case, all four sub-queries are compatible. Therefore, they can share the same fused subquery plan. During the preparation of physical plan, these fuesd aggregates will be detected and reused by the rule  `ReuseExchangeAndSubquery`.
   
   
   ### Why are the changes needed?
   To accelerate Spark SQL on queries which are composed of many compatible scalar subqueries, such as: [TPC-DS query9](https://github.com/Agirish/tpcds/blob/master/query9.sql)
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   This patch is verified with unit test. Tests cases are added in `SubquerySuite` and `AdaptiveQueryExec` to make sure thispatch works correctly in terms of logical plan and physical plan (AQE enabled). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #35708: [SPARK-38386][SQL] Combine compatible scalar subquery

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #35708:
URL: https://github.com/apache/spark/pull/35708


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #35708: [SPARK-38386][SQL] Combine compatible scalar subquery

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #35708:
URL: https://github.com/apache/spark/pull/35708#issuecomment-1056619310


   Hi @sperlingxx, this change look very similar to my suggested change: https://github.com/apache/spark/pull/32298.
   Actually, I think your PR is the same as the very first version of my PR was in terms of relying  on `ReuseExchangeAndSubquery` to reuse and run combined subqueries only once. But then my PR evolved into using a special root node `CommonScalarSubqueries` to store the combined subqueries and using `ScalarSubqueryReference` expressions as references to the common ones. (Please see conversation with @sigmod why it was required.)
   Our subquery "merging" logic are a bit different, please check mine as it can handle more complex subqueries. 
   
   cc @cloud-fan 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35708: [SPARK-38386][SQL] Combine compatible scalar subquery

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35708:
URL: https://github.com/apache/spark/pull/35708#issuecomment-1057615935


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sperlingxx commented on pull request #35708: [SPARK-38386][SQL] Combine compatible scalar subquery

Posted by GitBox <gi...@apache.org>.
sperlingxx commented on pull request #35708:
URL: https://github.com/apache/spark/pull/35708#issuecomment-1056660100


   > Hi @sperlingxx, this change look very similar to my suggested change: #32298. Actually, I think your PR is the same as the very first version of my PR was in terms of relying on `ReuseExchangeAndSubquery` to reuse and run combined subqueries only once. But then my PR evolved into using a special root node `CommonScalarSubqueries` to store the combined subqueries and using `ScalarSubqueryReference` expressions as references to the common ones. (Please see conversation with @sigmod why it was required.) Our subquery "merging" logic are a bit different, please check mine as it can handle more complex subqueries.
   > 
   > cc @cloud-fan
   
   Hi @peter-toth, I didn't notice that there exists another ongoing work on the same issue. For sure, compared with this patch, #32298 proposes a more comprehensive apporach to combine scalar subquiries.
   
   I will close this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org