You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by iyounus <gi...@git.apache.org> on 2015/12/18 19:56:51 UTC

[GitHub] spark pull request: R^2 for regression through the origin.

GitHub user iyounus opened a pull request:

    https://github.com/apache/spark/pull/10384

    R^2 for regression through the origin.

    Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyounus/spark SPARK_12331_R2_for_regression_through_origin

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10384.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10384
    
----
commit d80ecac7f011b57a867606bdc818154ea871580d
Author: Imran Younus <iy...@us.ibm.com>
Date:   2015-12-18T18:53:29Z

    Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-165870751
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168778083
  
    **[Test build #48680 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48680/consoleFull)** for PR 10384 at commit [`c563460`](https://github.com/apache/spark/commit/c5634607bd044ddf06bd4fe26fa4e4fe4885656a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166725661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48210/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166470324
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48144/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-167774912
  
    @iyounus are you still working on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168523352
  
    LGTM except minor issues pointed out by @srowen Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166725659
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168015796
  
    **[Test build #48487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48487/consoleFull)** for PR 10384 at commit [`ffad56f`](https://github.com/apache/spark/commit/ffad56fe3f4c1d3d083116b32e7ccddc6ae90110).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48239041
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,15 +23,23 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
    --- End diff --
    
    Please add back the empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48064092
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    --- End diff --
    
    I will double check this. I've seen this being called biased. The problem is that this special case is rarely discussed in literature so the documentation is scarce.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166479111
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166479115
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48146/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-165900186
  
    @iyounus please open JIRA and attach it to the title of this patch. See how other patches are opened.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48061706
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    --- End diff --
    
    We'll also have to explain what `isUnbiased` means. Is this really the canonical term? something about regression through the origin, or not fitting an intercept, would be more consistent with terminology in Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48390247
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,15 +23,23 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param regThroughOrigin true if intercept is not included in linear regression model
    --- End diff --
    
    That makes sense. Could you change `regThroughOrigin` to `throughOrigin` since reg is often used as regularization. Also, could you update the scala doc saying that `True if the regression is through the origin. For example, in linear regression, it will be true without fitting intercept`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48390399
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -61,7 +67,6 @@ class RegressionMetrics @Since("1.2.0") (
           case (prediction, _) => math.pow(prediction - yMean, 2)
         }.sum()
       }
    -
    --- End diff --
    
    Please add back this empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168979726
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168022225
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48487/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166725501
  
    **[Test build #48210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48210/consoleFull)** for PR 10384 at commit [`3744eb0`](https://github.com/apache/spark/commit/3744eb05f6941ea56fc94b3fea7320c0831c7ba0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168791849
  
    **[Test build #48680 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48680/consoleFull)** for PR 10384 at commit [`c563460`](https://github.com/apache/spark/commit/c5634607bd044ddf06bd4fe26fa4e4fe4885656a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RegressionMetrics @Since(\"2.0.0\") (`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168022137
  
    **[Test build #48487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48487/consoleFull)** for PR 10384 at commit [`ffad56f`](https://github.com/apache/spark/commit/ffad56fe3f4c1d3d083116b32e7ccddc6ae90110).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48692216
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -27,11 +27,17 @@ import org.apache.spark.sql.DataFrame
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param throughOrigin True if the regression is through the origin. For example, in linear
    + *                      regression, it will be true without fitting intercept.
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], throughOrigin: Boolean)
    --- End diff --
    
    @iyounus ah we have two tiny problems to sort then this can be merged. First the `@Since` annotation is wrong here; it needs to move to the new constructor, and this needs a new "since 2.0.0" annotation on the main constructor. Also, in the scaladoc, `predictionAndObservations` ends with a comma now but should be left unchanged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48239336
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,15 +23,23 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param regThroughOrigin true if intercept is not included in linear regression model
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], regThroughOrigin: Boolean)
    +    extends Logging {
    +
    +  /**
    +   * @param predictionAndObservations an RDD with two double columns:
    +   *                                  prediction and observation
    --- End diff --
    
    Should we copy the original scala doc, ie, `an RDD of (prediction, observation) pairs.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10384


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168792025
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168014813
  
    **[Test build #2267 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2267/consoleFull)** for PR 10384 at commit [`ffad56f`](https://github.com/apache/spark/commit/ffad56fe3f4c1d3d083116b32e7ccddc6ae90110).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168792027
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48680/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168014617
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48178585
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -105,6 +112,14 @@ class RegressionMetrics @Since("1.2.0") (
        */
       @Since("1.2.0")
       def r2: Double = {
    -    1 - SSerr / SStot
    +    // In case of regression through the origin (biased case), the definition of R^2 is
    +    // to be modified. Here is a review paper which explains why:
    +    // J. G. Eisenhauer, Regression through the Origin. Teaching Statistics 25, 76–80 (2003)
    +    // https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
    --- End diff --
    
    @iyounus Can you put this comment into the scaladoc instead? [This example](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L27) might be of use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166400018
  
    **[Test build #48122 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48122/consoleFull)** for PR 10384 at commit [`526d156`](https://github.com/apache/spark/commit/526d1562ed3bc9a0870ca542f6831868db5d8633).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48072262
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala ---
    @@ -22,91 +22,111 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
     import org.apache.spark.mllib.util.TestingUtils._
     
     class RegressionMetricsSuite extends SparkFunSuite with MLlibTestSparkContext {
    +  val obs = List[Double](77, 85, 62, 55, 63, 88, 57, 81, 51)
    +  val eps = 1E-5
     
       test("regression metrics for unbiased (includes intercept term) predictor") {
         /* Verify results in R:
    -       preds = c(2.25, -0.25, 1.75, 7.75)
    -       obs = c(3.0, -0.5, 2.0, 7.0)
    +        y = c(77, 85, 62, 55, 63, 88, 57, 81, 51)
    +        x = c(16, 22, 14, 10, 13, 19, 12, 18, 11)
    +        df <- as.data.frame(cbind(x, y))
    +        model <- lm(y ~  x, data=df)
    +        preds <- signif(predict(model), digits = 4)
     
    -       SStot = sum((obs - mean(obs))^2)
    -       SSreg = sum((preds - mean(obs))^2)
    -       SSerr = sum((obs - preds)^2)
    +        cat("predictions: ", preds, "\n")
    +        cat("explainedVariance =", mean((preds - mean(y))^2), "\n")
    +        cat("meanAbsoluteError =", mean(abs(preds - y)), "\n")
    +        cat("meanSquaredError  =", mean((preds - y)^2), "\n")
    +        cat("rmse =", sqrt(mean((preds - y)^2)), "\n")
    +        cat("r2 =", summary(model)$r.squared, "\n")
     
    -       explainedVariance = SSreg / length(obs)
    -       explainedVariance
    -       > [1] 8.796875
    -       meanAbsoluteError = mean(abs(preds - obs))
    -       meanAbsoluteError
    -       > [1] 0.5
    -       meanSquaredError = mean((preds - obs)^2)
    -       meanSquaredError
    -       > [1] 0.3125
    -       rmse = sqrt(meanSquaredError)
    -       rmse
    -       > [1] 0.559017
    -       r2 = 1 - SSerr / SStot
    -       r2
    -       > [1] 0.9571734
    +      Output of R code:
    +        predictions:  72.08 91.88 65.48 52.28 62.18 81.98 58.88 78.68 55.58
    +        explainedVariance = 157.3
    +        meanAbsoluteError = 3.735556
    +        meanSquaredError  = 17.53951
    +        rmse = 4.18802
    +        r2 = 0.8996822
          */
    -    val predictionAndObservations = sc.parallelize(
    -      Seq((2.25, 3.0), (-0.25, -0.5), (1.75, 2.0), (7.75, 7.0)), 2)
    +    val preds = List(72.08, 91.88, 65.48, 52.28, 62.18, 81.98, 58.88, 78.68, 55.58)
    +    val pairs: Seq[(Double, Double)] = preds.zip(obs)
    +    val predictionAndObservations = sc.parallelize(pairs, 2)
         val metrics = new RegressionMetrics(predictionAndObservations)
    -    assert(metrics.explainedVariance ~== 8.79687 absTol 1E-5,
    +    assert(metrics.explainedVariance ~== 157.3 absTol eps,
           "explained variance regression score mismatch")
    -    assert(metrics.meanAbsoluteError ~== 0.5 absTol 1E-5, "mean absolute error mismatch")
    -    assert(metrics.meanSquaredError ~== 0.3125 absTol 1E-5, "mean squared error mismatch")
    -    assert(metrics.rootMeanSquaredError ~== 0.55901 absTol 1E-5,
    +    assert(metrics.meanAbsoluteError ~== 3.735556 absTol eps, "mean absolute error mismatch")
    +    assert(metrics.meanSquaredError ~== 17.53951 absTol eps, "mean squared error mismatch")
    +    assert(metrics.rootMeanSquaredError ~== 4.18802 absTol eps,
           "root mean squared error mismatch")
    -    assert(metrics.r2 ~== 0.95717 absTol 1E-5, "r2 score mismatch")
    +    assert(metrics.r2 ~== 0.8996822 absTol eps, "r2 score mismatch")
       }
     
       test("regression metrics for biased (no intercept term) predictor") {
         /* Verify results in R:
    -       preds = c(2.5, 0.0, 2.0, 8.0)
    -       obs = c(3.0, -0.5, 2.0, 7.0)
    +        y = c(77, 85, 62, 55, 63, 88, 57, 81, 51)
    +        x = c(16, 22, 14, 10, 13, 19, 12, 18, 11)
    +        df <- as.data.frame(cbind(x, y))
    +        model <- lm(y ~ 0 + x, data=df)
    +        preds <- signif(predict(model), digits = 4)
     
    -       SStot = sum((obs - mean(obs))^2)
    -       SSreg = sum((preds - mean(obs))^2)
    -       SSerr = sum((obs - preds)^2)
    +        cat("predictions: ", preds, "\n")
    +        cat("explainedVariance =", mean((preds - mean(y))^2), "\n")
    +        cat("meanAbsoluteError =", mean(abs(preds - y)), "\n")
    +        cat("meanSquaredError  =", mean((preds - y)^2), "\n")
    +        cat("rmse =", sqrt(mean((preds - y)^2)), "\n")
    +        cat("r2 =", summary(model)$r.squared, "\n")
     
    -       explainedVariance = SSreg / length(obs)
    -       explainedVariance
    -       > [1] 8.859375
    -       meanAbsoluteError = mean(abs(preds - obs))
    -       meanAbsoluteError
    -       > [1] 0.5
    -       meanSquaredError = mean((preds - obs)^2)
    -       meanSquaredError
    -       > [1] 0.375
    -       rmse = sqrt(meanSquaredError)
    -       rmse
    -       > [1] 0.6123724
    -       r2 = 1 - SSerr / SStot
    -       r2
    -       > [1] 0.9486081
    +      Output of R code:
    --- End diff --
    
    I can follow that conversion. I just thought it was easier to save the code in a file and run it from R prompt.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-167017269
  
    LGTM except minor issues. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166400137
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48122/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48209708
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -105,6 +112,14 @@ class RegressionMetrics @Since("1.2.0") (
        */
       @Since("1.2.0")
       def r2: Double = {
    -    1 - SSerr / SStot
    +    // In case of regression through the origin (biased case), the definition of R^2 is
    +    // to be modified. Here is a review paper which explains why:
    +    // J. G. Eisenhauer, Regression through the Origin. Teaching Statistics 25, 76–80 (2003)
    +    // https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
    --- End diff --
    
    OK, I think I didn't understand your comment before. I'll move this note to scaladoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48164151
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    --- End diff --
    
    +1 for something along lines of regression through the origin.
    
    I don't like `fitIntercept` since some regression models (like tree models) don't have a concept of intercept. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48058263
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -61,7 +64,9 @@ class RegressionMetrics @Since("1.2.0") (
           case (prediction, _) => math.pow(prediction - yMean, 2)
         }.sum()
       }
    -
    +  private lazy val SSy = predictionAndObservations.map {
    --- End diff --
    
    This should already be computed in the statistical summarizer. This can be declared like: 
    
    ```scala
    private lazy val SSy = math.pow(summary.normL2(0), 2)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48355484
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,15 +23,23 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param regThroughOrigin true if intercept is not included in linear regression model
    --- End diff --
    
    @dbtsai It may be nitpicking, but `RegressionMetrics` class can be used for any regression model and so `hasFitIntercept` doesn't make sense for all types of regressors (as mentioned in other comments). Someone evaluating a decision tree regression model might be confused by the parameter. I am not certain what is best, so I will defer to others' opinions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48200030
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,21 +23,28 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param isUnbiased true if intercept is included in linear regression model
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean)
    +    extends Logging {
    +  /**
    +    * @param predictionAndObservations an RDD with two double columns:
    +    *                                  prediction and observation
    +    */
    +  def this(predictionAndObservations: RDD[(Double, Double)]) =
    +    this(predictionAndObservations, true)
     
       /**
    -   * An auxiliary constructor taking a DataFrame.
    -   * @param predictionAndObservations a DataFrame with two double columns:
    -   *                                  prediction and observation
    -   */
    +    * An auxiliary constructor taking a DataFrame.
    --- End diff --
    
    PS I think these space changes should be undone


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166469828
  
    **[Test build #48144 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48144/consoleFull)** for PR 10384 at commit [`24fdf76`](https://github.com/apache/spark/commit/24fdf76e91b576b7648d33a53761919c2d3d83d9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48200009
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -105,6 +112,14 @@ class RegressionMetrics @Since("1.2.0") (
        */
       @Since("1.2.0")
       def r2: Double = {
    -    1 - SSerr / SStot
    +    // In case of regression through the origin (biased case), the definition of R^2 is
    +    // to be modified. Here is a review paper which explains why:
    +    // J. G. Eisenhauer, Regression through the Origin. Teaching Statistics 25, 76–80 (2003)
    +    // https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
    --- End diff --
    
    I can put it in scala docs as well. I think it would nice to have it here also, so that other developers can see why there are two definitions here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166715954
  
    **[Test build #48210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48210/consoleFull)** for PR 10384 at commit [`3744eb0`](https://github.com/apache/spark/commit/3744eb05f6941ea56fc94b3fea7320c0831c7ba0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168020269
  
    **[Test build #2267 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2267/consoleFull)** for PR 10384 at commit [`ffad56f`](https://github.com/apache/spark/commit/ffad56fe3f4c1d3d083116b32e7ccddc6ae90110).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48390400
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -53,6 +58,7 @@ class RegressionMetrics @Since("1.2.0") (
           )
         summary
       }
    --- End diff --
    
    Add an empty line here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48058626
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -105,6 +110,12 @@ class RegressionMetrics @Since("1.2.0") (
        */
       @Since("1.2.0")
       def r2: Double = {
    -    1 - SSerr / SStot
    +    // In case of regression through the origin (biased case), the definition of R^2 is
    +    // to be modified. See jira SPARK-12331
    --- End diff --
    
    Instead of referencing the JIRA in the comment, it is probably better to explicitly define how the value is computed in the scaladoc. A reference to an appropriate paper might be good as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48239605
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,15 +23,23 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param regThroughOrigin true if intercept is not included in linear regression model
    --- End diff --
    
    Should we just call it `hasIntercept` or `fitIntercept` for consistency with other code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166388345
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168013749
  
    Just made the changed as suggested by @dbtsai. Sorry for the delay.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166470322
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166470316
  
    **[Test build #48144 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48144/consoleFull)** for PR 10384 at commit [`24fdf76`](https://github.com/apache/spark/commit/24fdf76e91b576b7648d33a53761919c2d3d83d9).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48287387
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,15 +23,23 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param regThroughOrigin true if intercept is not included in linear regression model
    --- End diff --
    
    Both have pros and cons. I'm not sure which one is better from user's perspective (who is not reading the code). Personally, I would prefer `hasFitIntercept` to be consistent with the code. I can make this change if we can reach consensus.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48061415
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
    @@ -534,7 +534,7 @@ class LinearRegressionSummary private[regression] (
       @transient private val metrics = new RegressionMetrics(
         predictions
           .select(predictionCol, labelCol)
    -      .map { case Row(pred: Double, label: Double) => (pred, label) } )
    +      .map { case Row(pred: Double, label: Double) => (pred, label) }, model.getFitIntercept)
    --- End diff --
    
    Nit: I'd start the second arg on the next line, indented like the first arg `predictions...` to make it clear it's at the same level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48199690
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    --- End diff --
    
    I've skimmed through some literature, and see that some authors call the constant in regression as "bias term" but its not very common. So `isUnbiased` may not be right choice for this variable. As @sethah pointed out, `fitIntercept` is not applicable in other regression models. How about `regThroughOrigin`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by iyounus <gi...@git.apache.org>.
Github user iyounus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48288460
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala ---
    @@ -22,91 +22,115 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
     import org.apache.spark.mllib.util.TestingUtils._
     
     class RegressionMetricsSuite extends SparkFunSuite with MLlibTestSparkContext {
    +  val obs = List[Double](77, 85, 62, 55, 63, 88, 57, 81, 51)
    +  val eps = 1E-5
     
       test("regression metrics for unbiased (includes intercept term) predictor") {
    --- End diff --
    
    The old test passed after the changes I made except for the r^2 (as expected). The reason I changed the test completely was that I couldn't get the value of r^2 from R without fitting the model. I didn't want to copy/past my formula in R and get the results from there. I really wanted to get the rest of parameters also from the fit in R, but that can only be dome by using anova table which is cumbersome.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48061537
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    --- End diff --
    
    I think we may have to make this non-optional, but then keep the overloaded constructor, to retain binary compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48200114
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    --- End diff --
    
    That seems close enough to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166478969
  
    **[Test build #48146 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48146/consoleFull)** for PR 10384 at commit [`d4b819e`](https://github.com/apache/spark/commit/d4b819e471133266def3efba7bc480e06d5e6fc4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48239945
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala ---
    @@ -22,91 +22,115 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
     import org.apache.spark.mllib.util.TestingUtils._
     
     class RegressionMetricsSuite extends SparkFunSuite with MLlibTestSparkContext {
    +  val obs = List[Double](77, 85, 62, 55, 63, 88, 57, 81, 51)
    +  val eps = 1E-5
     
       test("regression metrics for unbiased (includes intercept term) predictor") {
    --- End diff --
    
    I thought when the intercept is included, the stats should be correct. Why do you change this test case? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48200071
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -23,21 +23,28 @@ import org.apache.spark.Logging
     import org.apache.spark.mllib.linalg.Vectors
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer}
     import org.apache.spark.sql.DataFrame
    -
     /**
      * Evaluator for regression.
      *
    - * @param predictionAndObservations an RDD of (prediction, observation) pairs.
    + * @param predictionAndObservations an RDD of (prediction, observation) pairs,
    + * @param isUnbiased true if intercept is included in linear regression model
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean)
    +    extends Logging {
    +  /**
    --- End diff --
    
    Need a space before this, and the scaladoc body lines are indented too much


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48061216
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    --- End diff --
    
    The default parameter here is redundant due to the overloaded constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166389342
  
    **[Test build #48122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48122/consoleFull)** for PR 10384 at commit [`526d156`](https://github.com/apache/spark/commit/526d1562ed3bc9a0870ca542f6831868db5d8633).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48061563
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala ---
    @@ -31,13 +30,17 @@ import org.apache.spark.sql.DataFrame
      */
     @Since("1.2.0")
     class RegressionMetrics @Since("1.2.0") (
    -    predictionAndObservations: RDD[(Double, Double)]) extends Logging {
    +    predictionAndObservations: RDD[(Double, Double)], isUnbiased: Boolean = true)
    +    extends Logging {
     
       /**
        * An auxiliary constructor taking a DataFrame.
        * @param predictionAndObservations a DataFrame with two double columns:
        *                                  prediction and observation
        */
    +  def this(predictionAndObservations: RDD[(Double, Double)]) =
    +    this(predictionAndObservations: RDD[(Double, Double)], true)
    --- End diff --
    
    Why does `predictionAndObservations` need a type in this invocation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-168022223
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: R^2 for regression through the origin.

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48062304
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala ---
    @@ -22,91 +22,111 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
     import org.apache.spark.mllib.util.TestingUtils._
     
     class RegressionMetricsSuite extends SparkFunSuite with MLlibTestSparkContext {
    +  val obs = List[Double](77, 85, 62, 55, 63, 88, 57, 81, 51)
    +  val eps = 1E-5
     
       test("regression metrics for unbiased (includes intercept term) predictor") {
         /* Verify results in R:
    -       preds = c(2.25, -0.25, 1.75, 7.75)
    -       obs = c(3.0, -0.5, 2.0, 7.0)
    +        y = c(77, 85, 62, 55, 63, 88, 57, 81, 51)
    +        x = c(16, 22, 14, 10, 13, 19, 12, 18, 11)
    +        df <- as.data.frame(cbind(x, y))
    +        model <- lm(y ~  x, data=df)
    +        preds <- signif(predict(model), digits = 4)
     
    -       SStot = sum((obs - mean(obs))^2)
    -       SSreg = sum((preds - mean(obs))^2)
    -       SSerr = sum((obs - preds)^2)
    +        cat("predictions: ", preds, "\n")
    +        cat("explainedVariance =", mean((preds - mean(y))^2), "\n")
    +        cat("meanAbsoluteError =", mean(abs(preds - y)), "\n")
    +        cat("meanSquaredError  =", mean((preds - y)^2), "\n")
    +        cat("rmse =", sqrt(mean((preds - y)^2)), "\n")
    +        cat("r2 =", summary(model)$r.squared, "\n")
     
    -       explainedVariance = SSreg / length(obs)
    -       explainedVariance
    -       > [1] 8.796875
    -       meanAbsoluteError = mean(abs(preds - obs))
    -       meanAbsoluteError
    -       > [1] 0.5
    -       meanSquaredError = mean((preds - obs)^2)
    -       meanSquaredError
    -       > [1] 0.3125
    -       rmse = sqrt(meanSquaredError)
    -       rmse
    -       > [1] 0.559017
    -       r2 = 1 - SSerr / SStot
    -       r2
    -       > [1] 0.9571734
    +      Output of R code:
    +        predictions:  72.08 91.88 65.48 52.28 62.18 81.98 58.88 78.68 55.58
    +        explainedVariance = 157.3
    +        meanAbsoluteError = 3.735556
    +        meanSquaredError  = 17.53951
    +        rmse = 4.18802
    +        r2 = 0.8996822
          */
    -    val predictionAndObservations = sc.parallelize(
    -      Seq((2.25, 3.0), (-0.25, -0.5), (1.75, 2.0), (7.75, 7.0)), 2)
    +    val preds = List(72.08, 91.88, 65.48, 52.28, 62.18, 81.98, 58.88, 78.68, 55.58)
    +    val pairs: Seq[(Double, Double)] = preds.zip(obs)
    +    val predictionAndObservations = sc.parallelize(pairs, 2)
         val metrics = new RegressionMetrics(predictionAndObservations)
    -    assert(metrics.explainedVariance ~== 8.79687 absTol 1E-5,
    +    assert(metrics.explainedVariance ~== 157.3 absTol eps,
           "explained variance regression score mismatch")
    -    assert(metrics.meanAbsoluteError ~== 0.5 absTol 1E-5, "mean absolute error mismatch")
    -    assert(metrics.meanSquaredError ~== 0.3125 absTol 1E-5, "mean squared error mismatch")
    -    assert(metrics.rootMeanSquaredError ~== 0.55901 absTol 1E-5,
    +    assert(metrics.meanAbsoluteError ~== 3.735556 absTol eps, "mean absolute error mismatch")
    +    assert(metrics.meanSquaredError ~== 17.53951 absTol eps, "mean squared error mismatch")
    +    assert(metrics.rootMeanSquaredError ~== 4.18802 absTol eps,
           "root mean squared error mismatch")
    -    assert(metrics.r2 ~== 0.95717 absTol 1E-5, "r2 score mismatch")
    +    assert(metrics.r2 ~== 0.8996822 absTol eps, "r2 score mismatch")
       }
     
       test("regression metrics for biased (no intercept term) predictor") {
         /* Verify results in R:
    -       preds = c(2.5, 0.0, 2.0, 8.0)
    -       obs = c(3.0, -0.5, 2.0, 7.0)
    +        y = c(77, 85, 62, 55, 63, 88, 57, 81, 51)
    +        x = c(16, 22, 14, 10, 13, 19, 12, 18, 11)
    +        df <- as.data.frame(cbind(x, y))
    +        model <- lm(y ~ 0 + x, data=df)
    +        preds <- signif(predict(model), digits = 4)
     
    -       SStot = sum((obs - mean(obs))^2)
    -       SSreg = sum((preds - mean(obs))^2)
    -       SSerr = sum((obs - preds)^2)
    +        cat("predictions: ", preds, "\n")
    +        cat("explainedVariance =", mean((preds - mean(y))^2), "\n")
    +        cat("meanAbsoluteError =", mean(abs(preds - y)), "\n")
    +        cat("meanSquaredError  =", mean((preds - y)^2), "\n")
    +        cat("rmse =", sqrt(mean((preds - y)^2)), "\n")
    +        cat("r2 =", summary(model)$r.squared, "\n")
     
    -       explainedVariance = SSreg / length(obs)
    -       explainedVariance
    -       > [1] 8.859375
    -       meanAbsoluteError = mean(abs(preds - obs))
    -       meanAbsoluteError
    -       > [1] 0.5
    -       meanSquaredError = mean((preds - obs)^2)
    -       meanSquaredError
    -       > [1] 0.375
    -       rmse = sqrt(meanSquaredError)
    -       rmse
    -       > [1] 0.6123724
    -       r2 = 1 - SSerr / SStot
    -       r2
    -       > [1] 0.9486081
    +      Output of R code:
    --- End diff --
    
    We should stick with the convention of declaring R code as if it were computed in an R shell. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10384#discussion_r48252529
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala ---
    @@ -22,91 +22,115 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
     import org.apache.spark.mllib.util.TestingUtils._
     
     class RegressionMetricsSuite extends SparkFunSuite with MLlibTestSparkContext {
    +  val obs = List[Double](77, 85, 62, 55, 63, 88, 57, 81, 51)
    +  val eps = 1E-5
     
       test("regression metrics for unbiased (includes intercept term) predictor") {
    --- End diff --
    
    I think @iyounus made up a new test that is then used for both cases. Yeah it'd be kind of good to see that the old test still passes, but this is pretty reliably benchmarked vs R now and is a slightly better test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166558267
  
    That LGTM; let me wait for comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166400134
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12331][ML] R^2 for regression through t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10384#issuecomment-166472716
  
    **[Test build #48146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48146/consoleFull)** for PR 10384 at commit [`d4b819e`](https://github.com/apache/spark/commit/d4b819e471133266def3efba7bc480e06d5e6fc4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org