You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ankuriitg <gi...@git.apache.org> on 2018/10/01 21:15:38 UTC

[GitHub] spark pull request #22604: [SPARK-25586][MLlib][Core] Replace toString metho...

GitHub user ankuriitg opened a pull request:

    https://github.com/apache/spark/pull/22604

    [SPARK-25586][MLlib][Core] Replace toString method with a summarize m…

    …ethod in GeneralizedLinearRegressionTrainingSummary
    
    ## What changes were proposed in this pull request?
    
    After the change in SPARK-25118, which enables spark-shell to run with default
    log level, test_glr_summary started failing with StackOverflow error.
    
    Cause: ClosureCleaner calls logDebug on various objects and when it is called
    for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which runs
    into infinite loop and fails with the below exception.
    
    Fix: Remove toString method and move the existing logic to a new method
    "summarize". This follows the general guideline as other summary objects do not
    have a toString method as well.
    
    ## How was this patch tested?
    
    Ran the python "mllib" and scala/junit tests for this module

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ankuriitg/spark ankurgupta/SPARK-25586

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22604.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22604
    
----
commit 8c162a74a06d74a8a60aade4bf6e8af5906a8fb7
Author: ankurgupta <an...@...>
Date:   2018-10-01T21:05:39Z

    [SPARK-25586][MLlib][Core] Replace toString method with a summarize method in
    GeneralizedLinearRegressionTrainingSummary
    
    After the change in SPARK-25118, which enables spark-shell to run with default
    log level, test_glr_summary started failing with StackOverflow error.
    
    Cause: ClosureCleaner calls logDebug on various objects and when it is called
    for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which runs
    into infinite loop and fails with the below exception.
    
    Fix: Remove toString method and move the existing logic to a new method
    "summarize". This follows the general guideline as other summary objects do not
    have a toString method as well.
    
    Testing Done:
    Ran the python "mllib" and scala/junit tests for this module

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22604: [SPARK-25586][MLlib][Core] Replace toString metho...

Posted by ankuriitg <gi...@git.apache.org>.
Github user ankuriitg closed the pull request at:

    https://github.com/apache/spark/pull/22604


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    OK, this PR should be closed then.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Jenkins, ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by ankuriitg <gi...@git.apache.org>.
Github user ankuriitg commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    I have opened another PR for this, with the recommended changes: https://github.com/apache/spark/pull/22616
    
    I had to change a few other logDebug statements as well. Please let me know if that works.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by ankuriitg <gi...@git.apache.org>.
Github user ankuriitg commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Makes sense and let me update the description about SPARK-25118


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by ankuriitg <gi...@git.apache.org>.
Github user ankuriitg commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    @vanzin @squito @bersprockets 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    This just removes `toString` entirely and adds a new method. I don't think we can do that.  What's the minimum change to `toString` that means it doesn't execute a Spark job? can the same info be preserved and presented otherwise?
    
    Alternatively, we go back to the other JIRA and figure out why it was causing an infinite loop in the first place.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96830/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    OK, backing up here, yes it's important to specify what the problem was that started this. It's exhibited in the error logs from your PR for SPARK-25118 (which is not submitted; might be worth clarifying your description).
    
    ```
    	at org.apache.spark.rdd.RDD.map(RDD.scala:371)
    	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance$lzycompute(GeneralizedLinearRegression.scala:1342)
    	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance(GeneralizedLinearRegression.scala:1315)
    	at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.toString(GeneralizedLinearRegression.scala:1556)
    	at java.lang.String.valueOf(String.java:2994)
    	at java.lang.StringBuilder.append(StringBuilder.java:131)
    	at scala.StringContext.standardInterpolator(StringContext.scala:125)
    	at scala.StringContext.s(StringContext.scala:95)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
    	at org.apache.spark.internal.Logging$class.logDebug(Logging.scala:62)
    	at org.apache.spark.util.ClosureCleaner$.logDebug(ClosureCleaner.scala:35)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
    	at scala.collection.immutable.List.foreach(List.scala:392)
    	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:289)
    	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    	at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
    	at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:372)
    	at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:371)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
    	at org.apache.spark.rdd.RDD.map(RDD.scala:371)
    	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance$lzycompute(GeneralizedLinearRegression.scala:1342)
    	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance(GeneralizedLinearRegression.scala:1315)
    	at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.toString(GeneralizedLinearRegression.scala:1556)
    	at java.lang.String.valueOf(String.java:2994)
    	at java.lang.StringBuilder.append(StringBuilder.java:131)
    	at scala.StringContext.standardInterpolator(StringContext.scala:125)
    	at scala.StringContext.s(StringContext.scala:95)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
    	at org.apache.spark.internal.Logging$class.logDebug(Logging.scala:62)
    	at org.apache.spark.util.ClosureCleaner$.logDebug(ClosureCleaner.scala:35)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
    	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
    	at scala.collection.immutable.List.foreach(List.scala:392)
    	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:289)
    	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    ```
    
    Yes, inside `toString`, computing `nullDeviance` and `deviance` runs a Spark job. The problem occurs at debug log level, because in the ClosureCleaner:
    
    ```
          if (log.isDebugEnabled) {
            ...
            logDebug(s" + outer objects: ${outerObjects.size}")
            outerObjects.foreach { o => logDebug(s"     $o") }
          }
    ```
    
    and that calls `toString` again and thus the problem.
    
    While I don't think it's ideal to call distributed jobs in `toString`, they're only computed once lazily, so that's reasonable actually.
    
    While we can remove the `nullDeviance` etc, that's useful info. But I wonder whether there are similar problems lurking just like this. That said, this code has existed since about Spark 1.4 without issue, I guess.
    
    But then, the value of this debug statement goes away almost entirely, I think, in Scala 2.12, where closures aren't implemented with outer classes.
    
    I think I favor just removing this last debug statement in `ClosureCleaner`, or at best logging their classes only. This seems like less loss and behavior change than removing useful info from the `toString`.
    
    Any other thoughts out there -- anyone care a lot about the debug info from the closure cleaner about outer class(es)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by ankuriitg <gi...@git.apache.org>.
Github user ankuriitg commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    @actuaryzhang @yanboliang 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by ankuriitg <gi...@git.apache.org>.
Github user ankuriitg commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Few variables (I think there are 3) printed in toString cause a Spark Job to be started and the main reason is that those variables are lazily evaluated. I can remove those variables from toString altogether.
    
    An alternative fix will be to change those lazily evaluated variables to instantaneously evaluated but that may cause some other issues that I am not aware of.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by ankuriitg <gi...@git.apache.org>.
Github user ankuriitg commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    If I remove the following code, then the test succeeds and no spark jobs are started. Shall I do this instead?
    
    `
          sb.append("\n")
          val nd = s"Null deviance: ${round(nullDeviance)} on $degreesOfFreedom degrees of freedom"
          val rd = s"Residual deviance: ${round(deviance)} on $residualDegreeOfFreedom degrees of " +
            "freedom"
          val l = math.max(nd.length, rd.length)
          sb.append(StringUtils.leftPad(nd, l))
          sb.append("\n")
          sb.append(StringUtils.leftPad(rd, l))
    
          if (family.name != "tweedie") {
            sb.append("\n")
            sb.append(s"AIC: " + round(aic))
          }
    `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    **[Test build #96830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96830/testReport)** for PR 22604 at commit [`8c162a7`](https://github.com/apache/spark/commit/8c162a74a06d74a8a60aade4bf6e8af5906a8fb7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22604: [SPARK-25586][MLlib][Core] Replace toString method with ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22604
  
    **[Test build #96830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96830/testReport)** for PR 22604 at commit [`8c162a7`](https://github.com/apache/spark/commit/8c162a74a06d74a8a60aade4bf6e8af5906a8fb7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org