You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ya...@apache.org on 2020/03/19 11:54:58 UTC
[spark] branch branch-3.0 updated: [SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new a8c08b1  [SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId
a8c08b1 is described below

commit a8c08b1d81aefd1e3d7f4616b76e2285f9981cc7
Author: Kris Mok <kr...@databricks.com>
AuthorDate: Thu Mar 19 20:53:01 2020 +0900

    [SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId
    
    ### What changes were proposed in this pull request?
    
    Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement.
    
    The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending.
    
    After this change, the following query:
    ```scala
    spark.range(10).agg(sum('id)).queryExecution.debug.codegen
    ```
    will always dump the generated code in a natural, stable order. A version of this example with shorter output is:
    ```
    spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println)
    *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
    +- *(1) Range (0, 10, step=1, splits=16)
    
    *(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L])
    +- Exchange SinglePartition, true, [id=#30]
       +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
          +- *(1) Range (0, 10, step=1, splits=16)
    ```
    
    The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant.
    
    ### Why are the changes needed?
    
    Minor improvement to aid WSCG debugging.
    
    ### Does this PR introduce any user-facing change?
    
    No user-facing change for end-users; minor change for developers who debug WSCG generated code.
    
    ### How was this patch tested?
    
    Manually tested the output; all other tests still pass.
    
    Closes #27955 from rednaxelafx/codegen.
    
    Authored-by: Kris Mok <kr...@databricks.com>
    Signed-off-by: Takeshi Yamamuro <ya...@apache.org>
    (cherry picked from commit a1776288f48d450fea28f50fef78fd6aa10a8160)
    Signed-off-by: Takeshi Yamamuro <ya...@apache.org>
---
 .../src/main/scala/org/apache/spark/sql/execution/debug/package.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala
index 6a57ef2..6c40104 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala
@@ -113,7 +113,7 @@ package object debug {
         s
       case s => s
     }
-    codegenSubtrees.toSeq.map { subtree =>
+    codegenSubtrees.toSeq.sortBy(_.codegenStageId).map { subtree =>
       val (_, source) = subtree.doCodeGen()
       val codeStats = try {
         CodeGenerator.compile(source)._2


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org