You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "AngersZhuuuu (via GitHub)" <gi...@apache.org> on 2024/01/17 09:04:47 UTC

[PR] [SPARK-46741][SQL] Cache Table with CET won't work [spark]

AngersZhuuuu opened a new pull request, #44767:
URL: https://github.com/apache/spark/pull/44767

   ### What changes were proposed in this pull request?
   Cache Table with CTE won't work, there are two reasons
     1. In the current code CTE in CacheTableAsSelect will be inlined
     2. CTERelation Ref and Def didn't handle the CTEId doCanonicalize issue
   
   Cause the current case can't be matched.
   
   
   ### Why are the changes needed?
   Fix bug
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yea, Cache table with CTE can work after this pr
   
   
   ### How was this patch tested?
   Added UT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1464219993


##########
sql/core/src/test/resources/sql-tests/inputs/cache.sql:
##########
@@ -0,0 +1,33 @@
+CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES (0, 0), (1, 1), (2, 2) AS t(c1, c2);
+CREATE TEMPORARY VIEW t2 AS
+WITH v as (
+  SELECT c1 + c1 c3 FROM t1
+)
+SELECT SUM(c3) s FROM v;
+
+CACHE TABLE cache_table
+WITH
+t2 AS (SELECT 1)
+SELECT * FROM t2;
+
+SELECT * FROM cache_table;
+
+EXPLAIN EXTENDED SELECT * FROM cache_table;
+
+-- Nested WithCTE

Review Comment:
   > hmm, why does it have nested CTE?
   
   If there is CTE in view, and query with CTE, will have two WithCTE
   
   <img width="1003" alt="截屏2024-01-24 10 21 50" src="https://github.com/apache/spark/assets/46485123/0b5c60ea-0f7c-408c-bd42-d6107afb9f32">
   
   
   Current code with handle each WithCTE by starting cteId from 0.
   Since the origin cteId is non-repeating,  I think the current code is safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1458253319


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(

Review Comment:
   does it access subquery expressions?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(

Review Comment:
   does it access subquery expressions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1458253653


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(
+        _.containsAnyPattern(CTE, PLAN_EXPRESSION)) {
+        case ref: CTERelationRef =>
+          ref.copy(cteId = defIndex(ref.cteId).toLong)
+

Review Comment:
   can we make sure there is no nested `WithCTE`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CET won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #44767:
URL: https://github.com/apache/spark/pull/44767#issuecomment-1895377798

   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1463229851


##########
sql/core/src/test/resources/sql-tests/inputs/cache.sql:
##########
@@ -0,0 +1,33 @@
+CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES (0, 0), (1, 1), (2, 2) AS t(c1, c2);
+CREATE TEMPORARY VIEW t2 AS
+WITH v as (
+  SELECT c1 + c1 c3 FROM t1
+)
+SELECT SUM(c3) s FROM v;
+
+CACHE TABLE cache_table
+WITH
+t2 AS (SELECT 1)
+SELECT * FROM t2;
+
+SELECT * FROM cache_table;
+
+EXPLAIN EXTENDED SELECT * FROM cache_table;
+
+-- Nested WithCTE

Review Comment:
   there is only one WITH in the query



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1458257373


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(
+        _.containsAnyPattern(CTE, PLAN_EXPRESSION)) {
+        case ref: CTERelationRef =>
+          ref.copy(cteId = defIndex(ref.cteId).toLong)
+

Review Comment:
   shall we add assert to guarantee this assertion?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(
+        _.containsAnyPattern(CTE, PLAN_EXPRESSION)) {
+        case ref: CTERelationRef =>
+          ref.copy(cteId = defIndex(ref.cteId).toLong)
+

Review Comment:
   shall we add assert to guarantee this assumption?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #44767:
URL: https://github.com/apache/spark/pull/44767#issuecomment-2100528131

   ping @cloud-fan Changed follow your PR, pls take a look again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #44767:
URL: https://github.com/apache/spark/pull/44767#issuecomment-1897673599

   How about current? @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1463025288


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(
+        _.containsAnyPattern(CTE, PLAN_EXPRESSION)) {
+        case ref: CTERelationRef =>
+          ref.copy(cteId = defIndex(ref.cteId).toLong)
+

Review Comment:
   @cloud-fan Found nested WithCTE case, added to `cte.sql` and here change the method to support nested CTE, pls take a look again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1463229360


##########
sql/core/src/test/resources/sql-tests/inputs/cache.sql:
##########
@@ -0,0 +1,33 @@
+CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES (0, 0), (1, 1), (2, 2) AS t(c1, c2);
+CREATE TEMPORARY VIEW t2 AS
+WITH v as (
+  SELECT c1 + c1 c3 FROM t1
+)
+SELECT SUM(c3) s FROM v;
+
+CACHE TABLE cache_table
+WITH
+t2 AS (SELECT 1)
+SELECT * FROM t2;
+
+SELECT * FROM cache_table;
+
+EXPLAIN EXTENDED SELECT * FROM cache_table;
+
+-- Nested WithCTE

Review Comment:
   hmm, why does it have nested CTE?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1458255607


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(
+        _.containsAnyPattern(CTE, PLAN_EXPRESSION)) {
+        case ref: CTERelationRef =>
+          ref.copy(cteId = defIndex(ref.cteId).toLong)
+

Review Comment:
   > can we make sure there is no nested `WithCTE`?
   
   Form the CTESubtitution's code, and it's pr, there won't have nested `WithCTE`.  https://github.com/apache/spark/pull/37751 cc @maryannxue 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1458506743


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -910,6 +910,30 @@ case class WithCTE(plan: LogicalPlan, cteDefs: Seq[CTERelationDef]) extends Logi
   def withNewPlan(newPlan: LogicalPlan): WithCTE = {
     withNewChildren(children.init :+ newPlan).asInstanceOf[WithCTE]
   }
+
+  override def doCanonicalize(): LogicalPlan = {
+    val canonicalized = super.doCanonicalize().asInstanceOf[WithCTE]
+    val defIndex = canonicalized.cteDefs.map(_.id).zipWithIndex.toMap
+
+    def canonicalizeCTE(plan: LogicalPlan): LogicalPlan = {
+      plan.resolveOperatorsUpWithPruning(
+        _.containsAnyPattern(CTE, PLAN_EXPRESSION)) {
+        case ref: CTERelationRef =>
+          ref.copy(cteId = defIndex(ref.cteId).toLong)
+

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #44767:
URL: https://github.com/apache/spark/pull/44767#discussion_r1464220112


##########
sql/core/src/test/resources/sql-tests/inputs/cache.sql:
##########
@@ -0,0 +1,33 @@
+CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES (0, 0), (1, 1), (2, 2) AS t(c1, c2);
+CREATE TEMPORARY VIEW t2 AS
+WITH v as (
+  SELECT c1 + c1 c3 FROM t1
+)
+SELECT SUM(c3) s FROM v;
+
+CACHE TABLE cache_table
+WITH
+t2 AS (SELECT 1)
+SELECT * FROM t2;
+
+SELECT * FROM cache_table;
+
+EXPLAIN EXTENDED SELECT * FROM cache_table;
+
+-- Nested WithCTE

Review Comment:
   GA failed by unrelated issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org