You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dbtsai <gi...@git.apache.org> on 2014/12/19 21:29:23 UTC

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/3746

    [SPARK-4907][MLlib] Inconsistent loss and gradient in LeastSquaresGradient compared with R

    In most of the academic paper and algorithm implementations, 
    people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 
    for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf
    
    Since MLlib uses different convention, this will result different residuals and 
    all the stats properties will be different from GLMNET package in R. 
    
    The model coefficients will be still the same under this change.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AlpineNow/spark lir

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3746.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3746
    
----
commit 0b2c29c2791306a257413e0434f346d2884a31a0
Author: DB Tsai <db...@alpinenow.com>
Date:   2014-12-19T20:27:39Z

    first commit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67711151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24657/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3746


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67701062
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24656/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67693515
  
    Seems reasonable to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67701050
  
      [Test build #24656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24656/consoleFull) for   PR 3746 at commit [`0b2c29c`](https://github.com/apache/spark/commit/0b2c29c2791306a257413e0434f346d2884a31a0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67693312
  
      [Test build #24656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24656/consoleFull) for   PR 3746 at commit [`0b2c29c`](https://github.com/apache/spark/commit/0b2c29c2791306a257413e0434f346d2884a31a0).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67711145
  
      [Test build #24657 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24657/consoleFull) for   PR 3746 at commit [`19c2e85`](https://github.com/apache/spark/commit/19c2e85a6a1be705a3b048ac12577fec7ac63691).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by bryanyang0528 <gi...@git.apache.org>.

Github user bryanyang0528 commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67847818
  
    @dbtsai Thank you for your clear explanation which helps me alot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67826137
  
    @bryanyang0528 I don't think anyone's suggesting that the extra factor of 1/2 is more or less correct or desirable per se. The solution doesn't depend on the absolute value of the loss function, but its minimum only. But I think the question here is being consistent with the loss function as implemented by other software packages, so that the absolute value can be compared, for the same setting of learning rate, overfitting param, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by bryanyang0528 <gi...@git.apache.org>.

Github user bryanyang0528 commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67825123
  
    On my opinion, I don't think the parameter of the cost function is 1/m or 1/2m is the critical deference.
    Across the cost function  L = alpha * 1/2n ||A weights-y||^2 (alpha is the learning rate), we can control the learning rate to acquire the same result no mater 1/m or 1/2m.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67909869
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by bryanyang0528 <gi...@git.apache.org>.

Github user bryanyang0528 commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67836176
  
    @srowen  I agree on that need a absolute value can be compared with others software. Maybe it would add a parameter to control the extra factor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3746#issuecomment-67702973
  
      [Test build #24657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24657/consoleFull) for   PR 3746 at commit [`19c2e85`](https://github.com/apache/spark/commit/19c2e85a6a1be705a3b048ac12577fec7ac63691).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/3746#issuecomment-67842962

@bryanyang0528 The learning rate issue here is different story. With modern optimization algorithms like LBFGS and OWLQN, the learning rate is not required. The algorithms will find the step size in line search step. As @srowen pointed out, the statistical property of model will be different without the 1/2 factor compared with other package. At Alpine Data Labs, I implemented generalized linear model with elastic net (mixing L1 and L2) using OWLQN in Spark, I can train and get exactly the same coefficients and the same statistical property for model including std error, p-value, t-value, residual plot, and QQ plot, etc. For lots of our customers in financial industry, those stats are very important, and it's required to get the same solution compared with well-known R's reference implementation with scalability.

Although I only have limited time on contributing to open source project, I'll try to have most of my work available in Spark 1.3.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org