You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/02 16:10:13 UTC

[GitHub] [spark] peter-toth opened a new pull request #35382: [SPARK-28090][SQL] Improve replaceAliasButKeepName performance

peter-toth opened a new pull request #35382:
URL: https://github.com/apache/spark/pull/35382


   ### What changes were proposed in this pull request?
   
   SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3:
   ```
   Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) AS numerics#42]
   +- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23]
      +- LogicalRDD [numerics#5], false
   ```
   If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both
   - `CollapseProject` rule is slow and 
   - some of the expression optimization rules running on the huge, not yet simplified expression tree of the single, collapsed `Project` node are slow.
   
   On Spark 3.3, after SPARK-36718, `CollapseProject` doesn't collapse such plans so the above issues don't occur, 
   but `PhysicalOperation` extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it in `AliasHelper.replaceAliasButKeepName()`. With a small change in that function we can avoid such costly operations.
   
   ### Why are the changes needed?
   The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds.
   
   ### Does this PR introduce _any_ user-facing change?
   The example query can be executed.
   
   ### How was this patch tested?
   Existing UTs + manual test of the example query in the ticket description.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #35382: [SPARK-28090][SQL] Improve replaceAliasButKeepName performance

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #35382:
URL: https://github.com/apache/spark/pull/35382#issuecomment-1084734783


   I think there is a build error currently on `master`:
   ```
   [error] [error] /__w/spark/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala:426:22: not enough arguments for constructor TaskContextImpl: (stageId: Int, stageAttemptNumber: Int, partitionId: Int, taskAttemptId: Long, attemptNumber: Int, numPartitions: Int, taskMemoryManager: org.apache.spark.memory.TaskMemoryManager, localProperties: java.util.Properties, metricsSystem: org.apache.spark.metrics.MetricsSystem, taskMetrics: org.apache.spark.executor.TaskMetrics, cpus: Int, resources: Map[String,org.apache.spark.resource.ResourceInformation])org.apache.spark.TaskContextImpl.
   [error] Unspecified value parameter metricsSystem.
   [error]       val context1 = new TaskContextImpl(0, 0, 0, 0, 0, null, null, null, cpus = 0)
   [error]                      ^
   [error] one error found
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #35382: [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #35382:
URL: https://github.com/apache/spark/pull/35382


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #35382: [SPARK-28090][SQL] Improve replaceAliasButKeepName performance

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #35382:
URL: https://github.com/apache/spark/pull/35382#issuecomment-1085518180


   > @peter-toth it should be fixed now, can you rebase this PR? thanks!
   
   Thanks. Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35382: [SPARK-28090][SQL] Improve replaceAliasButKeepName performance

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35382:
URL: https://github.com/apache/spark/pull/35382#issuecomment-1085406556


   @peter-toth it should be fixed now, can you rebase this PR? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #35382: [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #35382:
URL: https://github.com/apache/spark/pull/35382#issuecomment-1086800126


   Thanks @cloud-fan, @dongjoon-hyun for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org