You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/18 10:52:56 UTC

[GitHub] [spark] wangyum opened a new pull request, #36595: [SPARK-39216][SQL] Do not combine unions if project contains subqueries

wangyum opened a new pull request, #36595:
URL: https://github.com/apache/spark/pull/36595

   ### What changes were proposed in this pull request?
   
   Makes `CombineUnions` do not combine unions if project contains subqueries. For example:
   ```sql
   SELECT (SELECT IF(x, 1, 0)) AS a
   FROM (SELECT true) t(x)
   UNION
   SELECT 1 AS a
   ```
   
   It will throw exception:
   ```
   java.lang.IllegalStateException: Couldn't find x#4 in [] 
   ```
   
   ### Why are the changes needed?
   
   Fix bug.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36595: [SPARK-39216][SQL] Do not combine unions if project contains subqueries

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36595:
URL: https://github.com/apache/spark/pull/36595#discussion_r875775535


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1372,15 +1374,16 @@ object CombineUnions extends Rule[LogicalPlan] {
         // Push down projection through Union and then push pushed plan to Stack if
         // there is a Project.
         case Project(projectList, Distinct(u @ Union(children, byName, allowMissingCol)))
-            if projectList.forall(_.deterministic) && children.nonEmpty &&
+            if projectList.forall(e => e.deterministic && !hasSubquery(e)) && children.nonEmpty &&

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36595: [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36595:
URL: https://github.com/apache/spark/pull/36595#issuecomment-1131203934

   Could you update the PR description too, @wangyum ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #36595: [SPARK-39216][SQL] Do not combine unions if project contains subqueries

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #36595:
URL: https://github.com/apache/spark/pull/36595#issuecomment-1129946923

   @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36595: [SPARK-39216][SQL] Do not combine unions if project contains subqueries

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36595:
URL: https://github.com/apache/spark/pull/36595#discussion_r875761064


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1372,15 +1374,16 @@ object CombineUnions extends Rule[LogicalPlan] {
         // Push down projection through Union and then push pushed plan to Stack if
         // there is a Project.
         case Project(projectList, Distinct(u @ Union(children, byName, allowMissingCol)))
-            if projectList.forall(_.deterministic) && children.nonEmpty &&
+            if projectList.forall(e => e.deterministic && !hasSubquery(e)) && children.nonEmpty &&
               flattenDistinct && byName == topByName && allowMissingCol == topAllowMissingCol =>
           stack.pushAll(pushProjectionThroughUnion(projectList, u).reverse)
         case Project(projectList, Deduplicate(keys: Seq[Attribute], u: Union))
-            if projectList.forall(_.deterministic) && flattenDistinct && u.byName == topByName &&
-              u.allowMissingCol == topAllowMissingCol && AttributeSet(keys) == u.outputSet =>
+            if projectList.forall(e => e.deterministic && !hasSubquery(e)) && flattenDistinct &&

Review Comment:
   ditto



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1372,15 +1374,16 @@ object CombineUnions extends Rule[LogicalPlan] {
         // Push down projection through Union and then push pushed plan to Stack if
         // there is a Project.
         case Project(projectList, Distinct(u @ Union(children, byName, allowMissingCol)))
-            if projectList.forall(_.deterministic) && children.nonEmpty &&
+            if projectList.forall(e => e.deterministic && !hasSubquery(e)) && children.nonEmpty &&
               flattenDistinct && byName == topByName && allowMissingCol == topAllowMissingCol =>
           stack.pushAll(pushProjectionThroughUnion(projectList, u).reverse)
         case Project(projectList, Deduplicate(keys: Seq[Attribute], u: Union))
-            if projectList.forall(_.deterministic) && flattenDistinct && u.byName == topByName &&
-              u.allowMissingCol == topAllowMissingCol && AttributeSet(keys) == u.outputSet =>
+            if projectList.forall(e => e.deterministic && !hasSubquery(e)) && flattenDistinct &&
+              u.byName == topByName && u.allowMissingCol == topAllowMissingCol &&
+              AttributeSet(keys) == u.outputSet =>
           stack.pushAll(pushProjectionThroughUnion(projectList, u).reverse)
         case Project(projectList, u @ Union(children, byName, allowMissingCol))
-            if projectList.forall(_.deterministic) && children.nonEmpty &&
+            if projectList.forall(e => e.deterministic && !hasSubquery(e)) && children.nonEmpty &&

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #36595: [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #36595: [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery
URL: https://github.com/apache/spark/pull/36595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36595: [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36595:
URL: https://github.com/apache/spark/pull/36595#discussion_r876577966


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1355,7 +1356,8 @@ object CombineUnions extends Rule[LogicalPlan] {
     while (stack.nonEmpty) {
       stack.pop() match {
         case p1 @ Project(_, p2: Project)
-            if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) =>
+            if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) &&
+              !p1.projectList.exists(hasSubquery) && !p2.projectList.exists(hasSubquery) =>

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36595: [SPARK-39216][SQL] Do not combine unions if project contains subqueries

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36595:
URL: https://github.com/apache/spark/pull/36595#discussion_r875760859


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1372,15 +1374,16 @@ object CombineUnions extends Rule[LogicalPlan] {
         // Push down projection through Union and then push pushed plan to Stack if
         // there is a Project.
         case Project(projectList, Distinct(u @ Union(children, byName, allowMissingCol)))
-            if projectList.forall(_.deterministic) && children.nonEmpty &&
+            if projectList.forall(e => e.deterministic && !hasSubquery(e)) && children.nonEmpty &&

Review Comment:
   I don't think we need to check subquery here. The code in Spark 3.2 can pushdown project though union as well even if it contains subqueries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36595: [SPARK-39216][SQL] Do not combine unions if project contains subqueries

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36595:
URL: https://github.com/apache/spark/pull/36595#issuecomment-1129862490

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #36595: [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #36595:
URL: https://github.com/apache/spark/pull/36595#issuecomment-1131290098

   Merged to master/3.3.
   
   cc @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allisonwang-db commented on a diff in pull request #36595: [SPARK-39216][SQL] Do not combine unions if project contains subqueries

Posted by GitBox <gi...@apache.org>.
allisonwang-db commented on code in PR #36595:
URL: https://github.com/apache/spark/pull/36595#discussion_r876426769


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1355,7 +1356,8 @@ object CombineUnions extends Rule[LogicalPlan] {
     while (stack.nonEmpty) {
       stack.pop() match {
         case p1 @ Project(_, p2: Project)
-            if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) =>
+            if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) &&
+              !p1.projectList.exists(hasSubquery) && !p2.projectList.exists(hasSubquery) =>

Review Comment:
   `hasSubquery` -> `hasCorrelatedSubquery`. I think non-correlated subqueries should work but let's add a test case just to be sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #36595: [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #36595:
URL: https://github.com/apache/spark/pull/36595#discussion_r876577315


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -4427,6 +4427,31 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
         ))
     }
   }
+
+  test("SPARK-39216: Don't collapse projects in CombineUnions if it hasCorrelatedSubquery") {
+    checkAnswer(
+      sql(
+        """
+          |SELECT (SELECT IF(x, 1, 0)) AS a
+          |FROM (SELECT true) t(x)
+          |UNION
+          |SELECT 1 AS a
+        """.stripMargin),
+      Seq(Row(1)))
+
+    checkAnswer(
+      sql(
+        """
+          |SELECT x + 1
+          |FROM   (SELECT id
+          |               + (SELECT Max(id)
+          |                  FROM   range(2)) AS x
+          |        FROM   range(1)) t
+          |UNION
+          |SELECT 1 AS a
+        """.stripMargin),
+      Seq(Row(2), Row(1)))

Review Comment:
   ```
   === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineUnions ===
    Distinct                                                          Distinct
    +- Union false, false                                             +- Union false, false
   !   :- Project [(x#219L + cast(1 as bigint)) AS (x + 1)#225L]         :- Project [((id#221L + scalar-subquery#218 []) + cast(1 as bigint)) AS (x + 1)#225L]
   !   :  +- Project [(id#221L + scalar-subquery#218 []) AS x#219L]      :  :  +- Aggregate [max(id#222L) AS max(id)#224L]
   !   :     :  +- Aggregate [max(id#222L) AS max(id)#224L]              :  :     +- Range (0, 2, step=1, splits=None)
   !   :     :     +- Range (0, 2, step=1, splits=None)                  :  +- Range (0, 1, step=1, splits=None)
   !   :     +- Range (0, 1, step=1, splits=None)                        +- Project [cast(1 as bigint) AS a#226L]
   !   +- Project [cast(a#220 as bigint) AS a#226L]                         +- OneRowRelation
   !      +- Project [1 AS a#220]                                     
   !         +- OneRowRelation    
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36595: [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36595:
URL: https://github.com/apache/spark/pull/36595#issuecomment-1131206625

   Fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org