You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/07 04:03:14 UTC

[GitHub] [spark] wangyum opened a new pull request, #36477: [SPARK-39120][SQL] Prunes the duplicate columns from child of UnaryNode

wangyum opened a new pull request, #36477:
URL: https://github.com/apache/spark/pull/36477

   ### What changes were proposed in this pull request?
   
   Prunes the duplicate columns from child of `UnaryNode`. For example:
   ```scala
   sql("create table t1(a int, b int, c int) using parquet")
   sql("select a, max(b) from (select a, b, a, b, a from t1) group by a").explain(true)
   ```
   Before this PR:
   ```
   == Optimized Logical Plan ==
   Aggregate [a#0], [a#0, max(b#1) AS max(b)#4]
   +- Project [a#0, b#1, a#0, b#1, a#0]
      +- Relation default.t1[a#0,b#1,c#2] parquet
   ```
   
   After this PR:
   ```
   == Optimized Logical Plan ==
   Aggregate [a#0], [a#0, max(b#1) AS max(b)#4]
   +- Project [a#0, b#1]
      +- Relation default.t1[a#0,b#1,c#2] parquet
   ```
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #36477: [SPARK-39120][SQL] Prunes the duplicate columns from child of UnaryNode

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on code in PR #36477:
URL: https://github.com/apache/spark/pull/36477#discussion_r867590079


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -855,6 +855,15 @@ object ColumnPruning extends Rule[LogicalPlan] {
     case e @ Expand(_, _, child) if !child.outputSet.subsetOf(e.references) =>
       e.copy(child = prunedChild(child, e.references))
 
+    // Prunes the duplicate columns from child of UnaryNode, e.g.: Aggregate
+    case p: UnaryNode
+      if p.child.isInstanceOf[Project] && p.child.output.size > p.child.outputSet.size =>

Review Comment:
   Here should exclude script transform.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #36477: [SPARK-39120][SQL] Prunes the duplicate columns from child of UnaryNode

Posted by GitBox <gi...@apache.org>.

wangyum commented on code in PR #36477:
URL: https://github.com/apache/spark/pull/36477#discussion_r868870921


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -855,6 +855,15 @@ object ColumnPruning extends Rule[LogicalPlan] {
     case e @ Expand(_, _, child) if !child.outputSet.subsetOf(e.references) =>
       e.copy(child = prunedChild(child, e.references))
 
+    // Prunes the duplicate columns from child of UnaryNode, e.g.: Aggregate

Review Comment:
   OK. will close this pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36477: [SPARK-39120][SQL] Prunes the duplicate columns from child of UnaryNode

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36477:
URL: https://github.com/apache/spark/pull/36477#discussion_r867603924


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -855,6 +855,15 @@ object ColumnPruning extends Rule[LogicalPlan] {
     case e @ Expand(_, _, child) if !child.outputSet.subsetOf(e.references) =>
       e.copy(child = prunedChild(child, e.references))
 
+    // Prunes the duplicate columns from child of UnaryNode, e.g.: Aggregate

Review Comment:
   Hm, I think this doesn't work for now. Spark currently allows duplicated column names in some Dataset API cases too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum closed pull request #36477: [SPARK-39120][SQL] Prunes the duplicate columns from child of UnaryNode

Posted by GitBox <gi...@apache.org>.

wangyum closed pull request #36477: [SPARK-39120][SQL] Prunes the duplicate columns from child of UnaryNode
URL: https://github.com/apache/spark/pull/36477


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org