You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by DaveDeCaprio <gi...@git.apache.org> on 2018/11/18 06:38:04 UTC

[GitHub] spark pull request #23076: [SPARK-26103][SQL] Added maxDepth to limit the le...

GitHub user DaveDeCaprio opened a pull request:

    https://github.com/apache/spark/pull/23076

    [SPARK-26103][SQL] Added maxDepth to limit the length of text plans

    Nested query plans can get extremely large (hundreds of megabytes).
    
    ## What changes were proposed in this pull request?
    
    The PR puts in a limit on the nesting depth of trees to be printed when writing a plan string.  
    * The default limit is 15, which allows for reasonably nested plans.  
    * A new configuration parameter called spark.debug.maxToStringTreeDepth was added to control the depth.
    * When plans are truncated, "..." is printed to indicate that tree elements were removed.
    * A warning is printed out the first time a truncated plan is displayed.  The warning explains what happened and how to adjust the limit.
    
    ## How was this patch tested?
    
    A new unit test in QueryExecutionSuite which creates a highly nested plan and then ensures that the printed plan is not too long.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DaveDeCaprio/spark max-log-tree-depth

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23076.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23076
    
----
commit adc4f8efd4b51b77d3600bcba8f331e92f7ea3c6
Author: Dave DeCaprio <da...@...>
Date:   2018-11-18T06:29:16Z

    Added maxDepth to treeString which limits the depth of the printed string.

commit 3a9743fbc89358055c37cc45437f191fc5f15957
Author: Dave DeCaprio <da...@...>
Date:   2018-11-18T06:34:42Z

    Added maxDepth to treeString which limits the depth of the printed string.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    The goal of this is to avoid creation of massive strings if toString gets called on a QueryExecution.  In our use we have some large plans that generate plan strings in the hundreds of megabytes.  These plans are efficiently executed because of caching and partitioning, but we are running into out of memory errors because of the QueryExecution.toString.
    
    In our case, writing to a file would help the memory issue, but we'd still be writing 200Mb+ files for each query.  (Note these queries take seconds to run because 99% of the work is cached, and we run a bunch of them).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    This contribution is my original work and I license the work to the project under the project’s open source license.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23076: [SPARK-26103][SQL] Added maxDepth to limit the le...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23076#discussion_r234444462
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala ---
    @@ -454,8 +455,9 @@ case class InputAdapter(child: SparkPlan) extends UnaryExecNode with CodegenSupp
           writer: Writer,
           verbose: Boolean,
           prefix: String = "",
    -      addSuffix: Boolean = false): Unit = {
    -    child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false)
    +      addSuffix: Boolean = false,
    +      maxDepth: Int = TreeNode.maxTreeToStringDepth): Unit = {
    +    child.generateTreeString(depth, lastChildren, writer, verbose, prefix, addSuffix, maxDepth)
    --- End diff --
    
    Prefix and addSuffix are ignored in the original code but I can't tell why.  Seems like they should be passed through from parent to child here.  I adjusted, but it might not be a bug or maybe should go in a separate PR
    ```suggestion
        child.generateTreeString(depth, lastChildren, writer, verbose, prefix = "", addSuffix = false, maxDepth)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    I originally wrote this against 2.3.0 where the toString was always called as part of newExecutionId.  I'm not sure if that still happens in master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23076: [SPARK-26103][SQL] Added maxDepth to limit the le...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23076#discussion_r234444270
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ---
    @@ -75,7 +78,7 @@ object CurrentOrigin {
     }
     
     // scalastyle:off
    -abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
    +abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product with Logging {
    --- End diff --
    
    Not sure TreeNode should have the baggage of its own logger.  The catalyst.trees package object extends logger and there is a comment that tree nodes should just use that logger, but I don't see a way to do that since loggers are protected.
    
    I could add a log function to the package object and then use that from TreeNode:
    
        private[trees] def logWarning 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    @DaveDeCaprio What is the practical use case for limiting the plans? My initial goal in the PRs #22429, #23018 and #23039 is to dump whole plans and gather more info for debugging. Your PR has an opposite purpose, it seems. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23076: [SPARK-26103][SQL] Added maxDepth to limit the le...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23076#discussion_r234444348
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ---
    @@ -701,3 +725,23 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
         case _ => false
       }
     }
    +
    +object TreeNode {
    +  /**
    +    * Query plans for large, deeply nested plans can get extremely large. To limit the impact,
    +    * we add a parameter that limits the logging to the top layers if the tree gets too deep.
    +    * This can be overridden by setting the 'spark.debug.maxToStringTreeDepth' conf in SparkEnv.
    +    */
    +  val DEFAULT_MAX_TO_STRING_TREE_DEPTH = 15
    +
    +  def maxTreeToStringDepth: Int = {
    +    if (SparkEnv.get != null) {
    +      SparkEnv.get.conf.getInt("spark.debug.maxToStringTreeDepth", DEFAULT_MAX_TO_STRING_TREE_DEPTH)
    +    } else {
    +      DEFAULT_MAX_TO_STRING_TREE_DEPTH
    +    }
    +  }
    +
    +  /** Whether we have warned about plan string truncation yet. */
    +  private val treeDepthWarningPrinted = new AtomicBoolean(false)
    +}
    --- End diff --
    
    I wasn't sure where to put this code, so made a TreeNode companion object.  If there is a better place let me know.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    Limiting on pure size makes sense.  I'd be willing to implement that in a
    different PR.  Now that we are using a writer it should be pretty easy to do
    that.  That was harder in the Spark 2.4 branch which is why I took the
    approach I did.
    
     
    
    A few questions about that:
    
    *	I assume we still want a config parameter for maximum size.  Any
    suggestions on name: spark.debug.maxPlanStringSizeInMB?
    *	Is there a way to represent byte sizes in config?
    *	Should the limit be on each element of the plan or the overall plan?
    Logical Plan, Physical Plan, Optimized Logical Plan, etc?  I kind of like
    the idea of putting it on the treeString method itself since then it gets
    applied whereever that method is called rather than only in
    queryExecution.toString.
    
     
    
    Dave
    
     
    
    From: Jungtaek Lim <no...@github.com> 
    Sent: Sunday, November 18, 2018 11:07 PM
    To: apache/spark <sp...@noreply.github.com>
    Cc: David DeCaprio <da...@alum.mit.edu>; Mention
    <me...@noreply.github.com>
    Subject: Re: [apache/spark] [SPARK-26103][SQL] Added maxDepth to limit the
    length of text plans (#23076)
    
     
    
    I'm seeing both sides of needs: while I think dumping full plan into file is
    a good feature for debugging specific issue, retaining full plans for
    representing them to UI page have been a headache and three regarding issues
    (SPARK-23904 <https://issues.apache.org/jira/browse/SPARK-23904> ,
    SPARK-25380 <https://issues.apache.org/jira/browse/SPARK-25380> ,
    SPARK-26103 <https://issues.apache.org/jira/browse/SPARK-26103> ) are filed
    in 3 months which doesn't look like a thing we can say end users should take
    a workaround.
    
    One thing we may be aware is that huge plan is not generated not only from
    nested join, but also from lots of columns, like SPARK-23904. For
    SPARK-25380 we are not aware of which parts generate huge plan. So we might
    feel easier and flexible to just truncate to specific size rather than
    applying conditions.
    
    -
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub
    <https://github.com/apache/spark/pull/23076#issuecomment-439772850> , or
    mute the thread
    <https://github.com/notifications/unsubscribe-auth/AAzVuuhxE9ejYmlbwLnamQkHK
    NAGV8zpks5uwjxbgaJpZM4Ynzoz> .
    <https://github.com/notifications/beacon/AAzVulccV4Ui9iSrDibkHyHFCy06Tz8Cks5
    uwjxbgaJpZM4Ynzoz.gif> 
    
     
    <https://t.sidekickopen72.com/s1t/o/5/f18dQhb0S7kC8dDMPbW2n0x6l2B9gXrN7sKj6v
    5KRN6W56jT_C3Lq__nW1pctGF2VxGHQf197v5Y04?si=7000000001068294&pi=d758f5d1-2c3
    4-4ded-8e49-c782907b9060> 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23076: [SPARK-26103][SQL] Added maxDepth to limit the le...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio closed the pull request at:

    https://github.com/apache/spark/pull/23076


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by HeartSaVioR <gi...@git.apache.org>.

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    I'm seeing both sides of needs: while I think dumping full plan into file is a good feature for debugging specific issue, retaining full plans for representing them to UI page have been a headache and three regarding issues ([SPARK-23904](https://issues.apache.org/jira/browse/SPARK-23904), [SPARK-25380](https://issues.apache.org/jira/browse/SPARK-25380), [SPARK-26103](https://issues.apache.org/jira/browse/SPARK-26103)) are filed in 3 months which doesn't look like a thing we can say end users should take a workaround.
    
    One thing we may be aware is that huge plan is not generated not only from nested join, but also from lots of columns, like SPARK-23904. For SPARK-25380 we are not aware of which parts generate huge plan. So we might feel easier and flexible to just truncate to specific size rather than applying conditions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    Closing this pull request in favor of #23169


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    Any case where we use truncatedString to limit the output columns I also think makes sense to limit the plan depth.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by DaveDeCaprio <gi...@git.apache.org>.

Github user DaveDeCaprio commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    @MaxGekk and @hvanhovell, I think this PR is complementary to the PR for SPARK-26023 which recently made allowed for file dumping of queries.  
    
    It might make sense to only have the default depth apply to the toString function and not writing to a file.  We could have a parameter to toFile to control the depth, but that would default to Int.MaxValue. toString would use the configured parameter. 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/23076
  
    > I assume we still want a config parameter for maximum size.  Any suggestions on name: spark.debug.maxPlanStringSizeInMB?
    
    I would propose to make it more SQL specific, and put the config to the `spark.sql.debug` namespace. Especially in the light of moving `truncatedString` to `sql/catalyst` (see #23039 )


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org