You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tengpeng <gi...@git.apache.org> on 2017/11/02 05:15:16 UTC

[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

GitHub user tengpeng opened a pull request:

    https://github.com/apache/spark/pull/19638

    [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

    ## What changes were proposed in this pull request?
    
    I added adjusted R2 as a regression metric which was implemented in all major statistical analysis tools.
    
    In practice, no one looks at R2 alone. The reason is R2 itself is misleading. If we add more parameters, R2 will not decrease but only increase (or stay the same). This leads to overfitting. Adjusted R2 addressed this issue by using number of parameters as "weight" for the sum of errors.
    
    
    ## How was this patch tested?
    
    - Added a new unit test and passed.
    - ./dev/run-tests all passed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tengpeng/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19638
    
----
commit adee7b418f9e9feb70ec9abfaba9ab34c789523b
Author: test <jo...@quetica.com>
Date:   2017-11-02T05:01:55Z

    Implement Adjusted R2 with a new unit test

commit 692fcb3dd332c677d9dd4f75ebb3ed14db495d7c
Author: test <jo...@quetica.com>
Date:   2017-11-02T05:03:12Z

    Merge branch 'master' of git://git.apache.org/spark

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    Do we need to expose it publicly in the trait? Could we make it `private[mllib]`? Since it will only be used to compute this metric.
    
    Alternatively, you can just compute `n` from the size of one of the stats vectors.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    Hm, right, this fails because it's adding a method to a public trait. Although it's unlikely anyone implements this, they might. Hm, not sure what we do here that can preserve source and binary compatibility. Otherwise it'd have to wait for 3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19638


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148619618
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala ---
    @@ -230,6 +230,13 @@ class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with S
       override def count: Long = totalCnt
     
       /**
    +   * Number of parameters
    +   *
    +   */
    +  @Since("2.3.0")
    +  override def numParam: Int = n
    --- End diff --
    
    `numParams`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r149731177
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    Is there something wrong with the code I pasted above? That worked for me when I was using the R shell.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r150393944
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    1. Yes. 2. They(0.9999 and  0.9998736) are actually the same if on the same precision. 3. I only used import data. No other code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r150394001
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    The confusion lies in my incorrect assumption that R Studio's naming convention when importing data is the same as R shell. R Studio uses `X_` for all variables including the dependent variable.
    I have followed @sethah 's suggestion, replacing with `lm_fit <- lm(V1 ~ V2 + V3, data = d1)`. We should be fine here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    **[Test build #3987 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3987/testReport)** for PR 19638 at commit [`68e467e`](https://github.com/apache/spark/commit/68e467e1e78f2deb9aaa6fe2d6cfd907736d5668).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148461389
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala ---
    @@ -230,6 +230,12 @@ class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with S
       override def count: Long = totalCnt
     
       /**
    +   *
    --- End diff --
    
    Remove the empty scaladoc and fix since


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r150380973
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    Is the suggestion to use the code in https://github.com/apache/spark/pull/19638/files#r148852081 as the snippet here? that seems to show a different adjusted R2 than what the test asserts though. @tengpeng what's some R code that should execute given the code before this snippet that produces the R2 / adjusted R2 values below? 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148626297
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala ---
    @@ -49,8 +49,8 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui
        */
       @Since("1.4.0")
       val metricName: Param[String] = {
    -    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "mae"))
    -    new Param(this, "metricName", "metric name in evaluation (mse|rmse|r2|mae)", allowedParams)
    +    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "r2adj", "mae"))
    +    new Param(this, "metricName", "metric name available (mse|rmse|r2|r2adj|mae)", allowedParams)
    --- End diff --
    
    This is simply because the length of line is > 100 if I keep "in evaluation". I am ok to keep it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148852449
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    +          summary(lm(X1 ~ ., data = part_00000))
    --- End diff --
    
    this still doesn't test the without intercept case. I think we actually need a test just to test the summary, where we can test with/without intercept r2 and other metrics against R code. Would you mind making another standalone test that compares the linear regression summary to R (both with/without intercept)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148460065
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -125,4 +125,14 @@ class RegressionMetrics @Since("2.0.0") (
           1 - SSerr / SStot
         }
       }
    +
    +  /**
    +   * Returns adjusted R^2^, the adjusted coefficient of determination.
    +   * @see <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2">
    +   * Coefficient of determination (Wikipedia)</a>
    +   */
    +  @Since("2.2.0")
    --- End diff --
    
    This must be since 2.3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148618367
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala ---
    @@ -49,8 +49,8 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui
        */
       @Since("1.4.0")
       val metricName: Param[String] = {
    -    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "mae"))
    -    new Param(this, "metricName", "metric name in evaluation (mse|rmse|r2|mae)", allowedParams)
    +    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "r2adj", "mae"))
    +    new Param(this, "metricName", "metric name available (mse|rmse|r2|r2adj|mae)", allowedParams)
    --- End diff --
    
    How about "valid metrics" instead?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148619734
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala ---
    @@ -73,6 +73,11 @@ class RegressionEvaluatorSuite
         evaluator.setMetricName("r2")
         assert(evaluator.evaluate(predictions) ~== 0.9998387 absTol 0.01)
     
    +    // Adjusted r2
    --- End diff --
    
    can you add the R code to the comment above this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148692614
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -125,4 +125,14 @@ class RegressionMetrics @Since("2.0.0") (
           1 - SSerr / SStot
         }
       }
    +
    +  /**
    +   * Returns adjusted R^2^, the adjusted coefficient of determination.
    +   * @see <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2">
    +   * Coefficient of determination (Wikipedia)</a>
    +   */
    +  @Since("2.3.0")
    +  def r2adj: Double = {
    +    1 - (SSerr / (summary.count - summary.numParam - 1)) / (SStot / (summary.count - 1))
    --- End diff --
    
    1.  Will handle the no intercept case. 
    2. You are absolutely correct there is an issue in design.
    
    > Ok, but then you can't use it when doing cross validation right?
    - For linear regression, there is nothing to cross validate. For penalized methods, like Lasso, `R^2` is not used for cross validation at any time. 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r149560345
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    Thanks for the clarification. Do you think change `x1` to `V1` would help?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    **[Test build #3986 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3986/testReport)** for PR 19638 at commit [`68e467e`](https://github.com/apache/spark/commit/68e467e1e78f2deb9aaa6fe2d6cfd907736d5668).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148869278
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    I just want to confirm: is `summary(lm(X1 ~ ., data = part_00000))` this does not work on your side? Is it because you used different variable name or data name?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148619202
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala ---
    @@ -230,6 +230,13 @@ class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with S
       override def count: Long = totalCnt
     
       /**
    +   * Number of parameters
    +   *
    --- End diff --
    
    don't need this line


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148664242
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -125,4 +125,14 @@ class RegressionMetrics @Since("2.0.0") (
           1 - SSerr / SStot
         }
       }
    +
    +  /**
    +   * Returns adjusted R^2^, the adjusted coefficient of determination.
    +   * @see <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2">
    +   * Coefficient of determination (Wikipedia)</a>
    +   */
    +  @Since("2.3.0")
    +  def r2adj: Double = {
    +    1 - (SSerr / (summary.count - summary.numParam - 1)) / (SStot / (summary.count - 1))
    --- End diff --
    
    This isn't correct for the case when there is no intercept. This [previous PR](https://github.com/apache/spark/pull/10384/) is relevant. Actually, there's a bigger problem: `RegressionMetrics` is only passed predictions and observations, nothing about the regression model that was used to fit it. Adjusted r2 doesn't make sense here. In fact, r2 shouldn't be here either since it's only valid for linear regression models. 
    
    The solution I propose: add a `val r2adj` in the linear regression summary, but simply define it in terms of the r2 value and don't add it to regression metrics or regression evaluator. 
    
    ```scala
    val r2adj: Double = {
        val interceptDOF = if (privateModel.getFitIntercept) 1 else 0
        1 - (1 - r2) * (numInstances - interceptDOF) / (numInstances - privateModel.coefficients.size - interceptDOF)
      }
    ```
    
    Ok, but then you can't use it when doing cross validation right? I'm not sure if there's a solution there - maybe to make a `LinearRegressionEvaluator`? `r2` and `adjr2` are not valid for non-linear regression http://statisticsbyjim.com/regression/r-squared-invalid-nonlinear-regression/.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    **[Test build #3987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3987/testReport)** for PR 19638 at commit [`68e467e`](https://github.com/apache/spark/commit/68e467e1e78f2deb9aaa6fe2d6cfd907736d5668).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148641672
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala ---
    @@ -73,6 +73,11 @@ class RegressionEvaluatorSuite
         evaluator.setMetricName("r2")
         assert(evaluator.evaluate(predictions) ~== 0.9998387 absTol 0.01)
     
    +    // Adjusted r2
    +    evaluator.setMetricName("r2adj")
    +    assert(evaluator.evaluate(predictions) ~== 0.9998 absTol 0.01)
    +    print(evaluator.evaluate(predictions))
    --- End diff --
    
    > Alternatively, you can just compute n from the size of one of the stats vectors.
    
    I tried this. In old `RegressionMetrics`, only `y` related information is passed, observed `y_i` & fitted `\hat{y}`. The length of `x` is sent from `MultivariateStatisticalSummary` but the width does not. 
    
    Maybe I am missing something here. Do you have any hints to get the width of `x` using `states vectors`?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r149299091
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    @tengpeng @sethah did you resolve this discussion?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    **[Test build #3986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3986/testReport)** for PR 19638 at commit [`68e467e`](https://github.com/apache/spark/commit/68e467e1e78f2deb9aaa6fe2d6cfd907736d5668).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    Not sure what's happening here. The test on my local machine passed:
    
    ========================================================================
    Running Apache RAT checks
    ========================================================================
    mkdir: target: File exists
    RAT checks passed.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148461468
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.scala ---
    @@ -44,6 +44,8 @@ trait MultivariateStatisticalSummary {
       @Since("1.0.0")
       def count: Long
     
    +  def numParam: Int
    --- End diff --
    
    Needs scaladoc and since


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148919520
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    +          summary(lm(X1 ~ ., data = part_00000))
    --- End diff --
    
    The existing test "Linear regression model training summary" does not test without intercept case. If there is a "Linear regression model training summary wo intercept", I am glad to add in it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    I have used @sethah 's approach to address the issues we have. Since we are not adding a new method to the public trait, there is no more binary compatibility issue.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148634321
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala ---
    @@ -49,8 +49,8 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui
        */
       @Since("1.4.0")
       val metricName: Param[String] = {
    --- End diff --
    
    Change the comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    Ignore it, it's due to an unrelated external problem with the repo1.maven.org domain, I'm pretty sure. I'm about to push a fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148662710
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala ---
    @@ -230,6 +230,13 @@ class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with S
       override def count: Long = totalCnt
     
       /**
    +   * Number of parameters
    +   *
    +   */
    +  @Since("2.3.0")
    +  override def numParam: Int = n
    --- End diff --
    
    Hmmm... no this isn't right. When this is used with regression metrics, the summary is of dimension 2 always since the `predictionsAndObservations` is just an rdd of (prediction, label) pairs. The unit tests weren't thorough enough to catch it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    **[Test build #3977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3977/testReport)** for PR 19638 at commit [`180b2d6`](https://github.com/apache/spark/commit/180b2d6931620bb81365a4958ed0b232bdc4ccf1).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    @srowen I have fixed scaladocs and since issues. I will pay special attention to this issue next time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148634479
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
    @@ -722,6 +722,17 @@ class LinearRegressionSummary private[regression] (
       @Since("1.5.0")
       val r2: Double = metrics.r2
     
    +  /**
    +   * Returns Adjusted R^2^, the adjusted coefficient of determination.
    +   * Reference: <a href="http://en.wikipedia.org/wiki/Coefficient_of_determination">
    --- End diff --
    
    maybe Coefficient_of_determination#Adjusted_R2?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r149559666
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    There may be some confusion. If you type that code, "as-is", into an R shell, it will not work. It reference a variable called `X1`, which is never defined. When we provide R code in comments like this, we intend for it to be copy and pasted into a shell and just work. So, it does not function.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148852081
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    This code doesn't work. 
    
    ```r
    lm_fit <- lm(V1 ~ V2 + V3, data = d1)
    summary(lm_fit)$r.squared
    [1] 0.9998737
    summary(lm_fit)$adj.r.squared
    [1] 0.9998736
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148612789
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala ---
    @@ -49,8 +49,8 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui
        */
       @Since("1.4.0")
       val metricName: Param[String] = {
    -    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "mae"))
    -    new Param(this, "metricName", "metric name in evaluation (mse|rmse|r2|mae)", allowedParams)
    +    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "r2adj", "mae"))
    +    new Param(this, "metricName", "metric name available (mse|rmse|r2|r2adj|mae)", allowedParams)
    --- End diff --
    
    Not sure you want to change "in evaluation" here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r149558607
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    @srowen it's fine in terms of functioning. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    **[Test build #3977 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3977/testReport)** for PR 19638 at commit [`180b2d6`](https://github.com/apache/spark/commit/180b2d6931620bb81365a4958ed0b232bdc4ccf1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

Posted by tengpeng <gi...@git.apache.org>.
Github user tengpeng commented on the issue:

    https://github.com/apache/spark/pull/19638
  
    Would it be possible to add me to the white list for test? Thanks.
    
    On Thu, Nov 2, 2017 at 12:17 AM UCB AMPLab <no...@github.com> wrote:
    
    > Can one of the admins verify this patch?
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/19638#issuecomment-341320201>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/ACmTss9uceSDuvLGu1ddnyDcjd2WQahmks5syVCHgaJpZM4QPLaB>
    > .
    >
    -- 
    发自移动版 Gmail



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148623948
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala ---
    @@ -73,6 +73,11 @@ class RegressionEvaluatorSuite
         evaluator.setMetricName("r2")
         assert(evaluator.evaluate(predictions) ~== 0.9998387 absTol 0.01)
     
    +    // Adjusted r2
    +    evaluator.setMetricName("r2adj")
    +    assert(evaluator.evaluate(predictions) ~== 0.9998 absTol 0.01)
    +    print(evaluator.evaluate(predictions))
    --- End diff --
    
    remove


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148585371
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala ---
    @@ -49,8 +49,8 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui
        */
       @Since("1.4.0")
       val metricName: Param[String] = {
    -    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "mae"))
    -    new Param(this, "metricName", "metric name in evaluation (mse|rmse|r2|mae)", allowedParams)
    +    val allowedParams = ParamValidators.inArray(Array("mse", "rmse", "r2", "r2adj", "mae"))
    --- End diff --
    
    can you add a `val supportedMetrics = Array("mse", ...)` to the companion object and use that here? Look at `LogisticRegression.family` parameter for reference.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19638#discussion_r148881376
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala ---
    @@ -764,13 +764,17 @@ class LinearRegressionSuite
               (Intercept) 6.3022157  0.0018600    3388   <2e-16 ***
               V2          4.6982442  0.0011805    3980   <2e-16 ***
               V3          7.1994344  0.0009044    7961   <2e-16 ***
    +
    +          # R code for r2adj
    --- End diff --
    
    What is `X1`? it's never defined.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org