You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/08 10:21:03 UTC

[GitHub] [spark] pan3793 opened a new pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

pan3793 opened a new pull request #35438:
URL: https://github.com/apache/spark/pull/35438


   ### What changes were proposed in this pull request?
   
   This PR propose to materialize `QueryPlan#subqueries` and pruning by `PLAN_EXPRESSION` on search to improve the performance.
   
   ### Why are the changes needed?
   
   We found a query in production that cost lots of time in optimize phase (also include AQE optimize phase) when enable DPP, the SQL pattern likes
   
   ```
   select <cols...>
   from a
   left join b on a.<col> = b.<col>
   left join c on b.<col> = c.<col>
   left join d on c.<col> = d.<col>
   left join e on d.<col> = e.<col>
   left join f on e.<col> = f.<col>
   left join g on f.<col> = g.<col>
   left join h on g.<col> = h.<col>
   ...
   ```
   SPARK-36444 significantly reduces the optimize time (exclude AQE phase), see detail at #35431, but there are still lots of time costs in `PlanDynamicPruningFilters` on AQE optimize phase.
   
   Before this change, the query costs 658s, after this change only costs 65s.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #35438:
URL: https://github.com/apache/spark/pull/35438


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35438:
URL: https://github.com/apache/spark/pull/35438#issuecomment-1033381938


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pan3793 commented on a change in pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #35438:
URL: https://github.com/apache/spark/pull/35438#discussion_r801532365



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##########
@@ -427,8 +427,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
   /**
    * All the top-level subqueries of the current plan node. Nested subqueries are not included.
    */
-  def subqueries: Seq[PlanType] = {
-    expressions.flatMap(_.collect {
+  lazy val subqueries: Seq[PlanType] = {

Review comment:
       Thanks for tips, updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on a change in pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on a change in pull request #35438:
URL: https://github.com/apache/spark/pull/35438#discussion_r801504522



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##########
@@ -427,8 +427,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
   /**
    * All the top-level subqueries of the current plan node. Nested subqueries are not included.
    */
-  def subqueries: Seq[PlanType] = {
-    expressions.flatMap(_.collect {
+  lazy val subqueries: Seq[PlanType] = {

Review comment:
       please add `@transient`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pan3793 commented on a change in pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #35438:
URL: https://github.com/apache/spark/pull/35438#discussion_r802260134



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##########
@@ -427,8 +427,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
   /**
    * All the top-level subqueries of the current plan node. Nested subqueries are not included.
    */
-  def subqueries: Seq[PlanType] = {
-    expressions.flatMap(_.collect {
+  lazy val subqueries: Seq[PlanType] = {

Review comment:
       `SparkPlan` is the subclass of `QueryPlan`,  which need to be sent to executor, use `@transient` to reduce the memory usage of executor.
   ```
   abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable
   ``` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pan3793 commented on pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
pan3793 commented on pull request #35438:
URL: https://github.com/apache/spark/pull/35438#issuecomment-1032445401


   cc @wangyum @cloud-fan @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
amaliujia commented on pull request #35438:
URL: https://github.com/apache/spark/pull/35438#issuecomment-1033313752


   cc @amaliujia


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #35438:
URL: https://github.com/apache/spark/pull/35438#issuecomment-1032449306


   cc @maryannxue @allisonwang-db @sigmod FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pan3793 commented on pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
pan3793 commented on pull request #35438:
URL: https://github.com/apache/spark/pull/35438#issuecomment-1041043654


   @cloud-fan would you please take a look? thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a change in pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
amaliujia commented on a change in pull request #35438:
URL: https://github.com/apache/spark/pull/35438#discussion_r802247279



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##########
@@ -427,8 +427,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
   /**
    * All the top-level subqueries of the current plan node. Nested subqueries are not included.
    */
-  def subqueries: Seq[PlanType] = {
-    expressions.flatMap(_.collect {
+  lazy val subqueries: Seq[PlanType] = {

Review comment:
       Just for education purpose: why `@transient` is useful here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pan3793 commented on a change in pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #35438:
URL: https://github.com/apache/spark/pull/35438#discussion_r802260134



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##########
@@ -427,8 +427,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
   /**
    * All the top-level subqueries of the current plan node. Nested subqueries are not included.
    */
-  def subqueries: Seq[PlanType] = {
-    expressions.flatMap(_.collect {
+  lazy val subqueries: Seq[PlanType] = {

Review comment:
       The `QueryPlan` e.g. `Expression` need to be sent to executor, use `@transient` to reduce the memory usage of executor.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pan3793 commented on a change in pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #35438:
URL: https://github.com/apache/spark/pull/35438#discussion_r802260134



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##########
@@ -427,8 +427,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
   /**
    * All the top-level subqueries of the current plan node. Nested subqueries are not included.
    */
-  def subqueries: Seq[PlanType] = {
-    expressions.flatMap(_.collect {
+  lazy val subqueries: Seq[PlanType] = {

Review comment:
       ~~The `QueryPlan` e.g. `Expression` need to be sent to executor, use `@transient` to reduce the memory usage of executor.~~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35438: [SPARK-38138][SQL] Materialize QueryPlan subqueries

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35438:
URL: https://github.com/apache/spark/pull/35438#issuecomment-1043979025


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org