You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "cloud-fan (via GitHub)" <gi...@apache.org> on 2023/07/17 07:40:36 UTC

[GitHub] [spark] cloud-fan opened a new pull request, #42036: [SPARK-44355][SQL] Move WithCTE into command queries

cloud-fan opened a new pull request, #42036:
URL: https://github.com/apache/spark/pull/42036

   This PR is based on https://github.com/apache/spark/pull/42022 to fix tests, as the PR author is on vacation.
   
   ### What changes were proposed in this pull request?
   In the PR, I propose to add new trait `CTEInChildren` and mix it to some commands that should have `WithCTE` on top of their children (queries) instead of main query. Also I modified  the `CTESubstitution` rule and removed special handling of `Command`s and similar nodes. After the changes, `Command`, `ParsedStatement` and `InsertIntoDir` are handled in the same way as other queries by referring to CTE Defs. Only the difference is in `WithCTE` is not not placed on the top of main query but on top of command queries.
   
   Closes #41922
   
   ### Why are the changes needed?
   To improve code maintenance. Right now the CTE resolution code path is diverged: query with commands go into CTE inline code path where non-command queries go into CTEDef code path. 
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   By running new test:
   ```
   $ build/sbt "test:testOnly *InsertSuite"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1273159181


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -887,6 +887,16 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   }
 }
 
+/**
+ * The logical node which is able to place the `WithCTE` node on its children.
+ */
+trait CTEInChildren extends LogicalPlan {
+  def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
+    withNewChildren(children.map(WithCTE(_, cteDefs)))

Review Comment:
   I'm not sure that it is always fine to duplicate CTE relations into multiple childrens.
   For example, if we have a non-deterministic relation definition and 1-1 reference to it in 2 childrens of `CTEInChildren` and then here we duplicate the relations into the 2 childrens then I think the `InlineCTE` rule will decide to inline the relation 2 times, which is not correct.
   But I agree with you, I don't see that this could happen now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1272960040


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3759,6 +3759,14 @@ object SQLConf {
     .checkValues(LegacyBehaviorPolicy.values.map(_.toString))
     .createWithDefault(LegacyBehaviorPolicy.EXCEPTION.toString)
 
+  val LEGACY_INLINE_CTE_IN_COMMANDS = buildConf("spark.sql.legacy.inlineCTEInCommands")
+    .internal()
+    .doc("If true, always inline the CTE relations for the queries in commands. This is the " +
+      "legacy behavior which may produce incorrect results because Spark may evaluate a CTE " +
+      "relation more than once, even if it's nondeterministic.")
+    .booleanConf
+    .createWithDefault(false)

Review Comment:
   good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #42036:
URL: https://github.com/apache/spark/pull/42036#issuecomment-1649890644

   the test failure is unrelated, I'm merging it to master, thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1272960217


##########
sql/core/src/test/resources/sql-tests/inputs/cte-command.sql:
##########
@@ -0,0 +1,33 @@
+-- WITH inside CTE

Review Comment:
   no, but the analyzed plan is different, as we always inline CTE before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1272959878


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala:
##########
@@ -1234,12 +1241,16 @@ case class RepairTable(
 case class AlterViewAs(
     child: LogicalPlan,
     originalText: String,
-    query: LogicalPlan) extends BinaryCommand {
+    query: LogicalPlan) extends BinaryCommand with CTEInChildren {
   override def left: LogicalPlan = child
   override def right: LogicalPlan = query
   override protected def withNewChildrenInternal(
       newLeft: LogicalPlan, newRight: LogicalPlan): LogicalPlan =
     copy(child = newLeft, query = newRight)
+
+  override def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
+    withNewChildren(Seq(child, WithCTE(query, cteDefs)))

Review Comment:
   `withNewChildren` can copy over the tree node tags.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1273159181


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -887,6 +887,16 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   }
 }
 
+/**
+ * The logical node which is able to place the `WithCTE` node on its children.
+ */
+trait CTEInChildren extends LogicalPlan {
+  def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
+    withNewChildren(children.map(WithCTE(_, cteDefs)))

Review Comment:
   I'm not sure that it is always fine to duplicate CTE relations into multiple childrens.
   For example, if we have a non-deterministic relation definition and 1-1 reference to it in 2 childrens of `CTEInChildren` and then we duplicate the relations into the 2 childrens then I think the `InlineCTE` rule decides to inline the relation 2 times, which is not correct.
   But I agree with you, I don't see that this could happen now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1270216567


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala:
##########
@@ -46,42 +46,65 @@ import org.apache.spark.sql.internal.SQLConf.{LEGACY_CTE_PRECEDENCE_POLICY, Lega
  * dependency for any valid CTE query (i.e., given CTE definitions A and B with B referencing A,
  * A is guaranteed to appear before B). Otherwise, it must be an invalid user query, and an
  * analysis exception will be thrown later by relation resolving rules.
+ *
+ * If the query is a SQL command or DML statement (extends `CTEInChildren`),
+ * place `WithCTE` into their children.
  */
 object CTESubstitution extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = {
     if (!plan.containsPattern(UNRESOLVED_WITH)) {
       return plan
     }
-    val isCommand = plan.exists {
-      case _: Command | _: ParsedStatement | _: InsertIntoDir => true
-      case _ => false
+
+    val forceInline = if (conf.getConf(SQLConf.LEGACY_INLINE_CTE_IN_COMMANDS)) {
+      // The legacy behavior always inlines the CTE relations for queries in commands.
+      plan.exists {
+        case _: Command | _: ParsedStatement | _: InsertIntoDir => true
+        case _ => false
+      }
+    } else {
+      val commands = plan.collect {
+        case c @ (_: Command | _: ParsedStatement | _: InsertIntoDir) => c
+      }
+      if (commands.length == 1) {
+        // If there is only one command and it's `CTEInChildren`, we can resolve
+        // CTE normally and don't need to force inline.
+        !commands.head.isInstanceOf[CTEInChildren]
+      } else if (commands.length > 1) {
+        // This can happen with the multi-insert statement. We should fall back to

Review Comment:
   Nit: There is duplicated logic here.
   To make the code more readable,  we can always collect the commands first. If the length of commands is 1, there is a different behavior based on the legacy conf. Otherwise the logic is determined. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1270217647


##########
sql/core/src/test/resources/sql-tests/inputs/cte-command.sql:
##########
@@ -0,0 +1,33 @@
+-- WITH inside CTE

Review Comment:
   QQ: Is there a case will fail before the code change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on PR #42036:
URL: https://github.com/apache/spark/pull/42036#issuecomment-1644973729

   LGTM over


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1271958966


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3759,6 +3759,14 @@ object SQLConf {
     .checkValues(LegacyBehaviorPolicy.values.map(_.toString))
     .createWithDefault(LegacyBehaviorPolicy.EXCEPTION.toString)
 
+  val LEGACY_INLINE_CTE_IN_COMMANDS = buildConf("spark.sql.legacy.inlineCTEInCommands")
+    .internal()
+    .doc("If true, always inline the CTE relations for the queries in commands. This is the " +
+      "legacy behavior which may produce incorrect results because Spark may evaluate a CTE " +
+      "relation more than once, even if it's nondeterministic.")
+    .booleanConf
+    .createWithDefault(false)

Review Comment:
   Do we need version here?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala:
##########
@@ -1234,12 +1241,16 @@ case class RepairTable(
 case class AlterViewAs(
     child: LogicalPlan,
     originalText: String,
-    query: LogicalPlan) extends BinaryCommand {
+    query: LogicalPlan) extends BinaryCommand with CTEInChildren {
   override def left: LogicalPlan = child
   override def right: LogicalPlan = query
   override protected def withNewChildrenInternal(
       newLeft: LogicalPlan, newRight: LogicalPlan): LogicalPlan =
     copy(child = newLeft, query = newRight)
+
+  override def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
+    withNewChildren(Seq(child, WithCTE(query, cteDefs)))

Review Comment:
   Why not just `copy(query = WithCTE(...` like at other places?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -887,6 +887,16 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   }
 }
 
+/**
+ * The logical node which is able to place the `WithCTE` node on its children.
+ */
+trait CTEInChildren extends LogicalPlan {
+  def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
+    withNewChildren(children.map(WithCTE(_, cteDefs)))

Review Comment:
   I wonder if it makes sense to assert that we have only 1 child.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1272959723


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -887,6 +887,16 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   }
 }
 
+/**
+ * The logical node which is able to place the `WithCTE` node on its children.
+ */
+trait CTEInChildren extends LogicalPlan {
+  def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
+    withNewChildren(children.map(WithCTE(_, cteDefs)))

Review Comment:
   It seems fine to have multiple children, we just duplicate the CTE relations. The current code does not allow it though, and go back to inline CTE.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1272960408


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3759,6 +3759,14 @@ object SQLConf {
     .checkValues(LegacyBehaviorPolicy.values.map(_.toString))
     .createWithDefault(LegacyBehaviorPolicy.EXCEPTION.toString)
 
+  val LEGACY_INLINE_CTE_IN_COMMANDS = buildConf("spark.sql.legacy.inlineCTEInCommands")
+    .internal()
+    .doc("If true, always inline the CTE relations for the queries in commands. This is the " +
+      "legacy behavior which may produce incorrect results because Spark may evaluate a CTE " +
+      "relation more than once, even if it's nondeterministic.")
+    .booleanConf

Review Comment:
   ```suggestion
       .version("4.0.0")
       .booleanConf
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries
URL: https://github.com/apache/spark/pull/42036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #42036: [SPARK-44355][SQL] Move WithCTE into command queries

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on code in PR #42036:
URL: https://github.com/apache/spark/pull/42036#discussion_r1270216924


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala:
##########
@@ -273,4 +297,21 @@ object CTESubstitution extends Rule[LogicalPlan] {
             e.withNewPlan(apply(substituteCTE(e.plan, alwaysInline, cteRelations)))
         }
     }
+  }
+
+  /**
+   * For commands which extend `CTEInChildren`, we should place the `WithCTE` node on its
+   * children. There are two reasons:
+   *  1. Some rules will pattern match the root command nodes, and we should keep command
+   *     as the root node to not break them.
+   *  2. `Dataset` eagerly executes the commands inside a query plan. However, the CTE

Review Comment:
   Nit: Shall we have an example for the eager execution?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org