You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yanboliang <gi...@git.apache.org> on 2016/02/18 07:42:00 UTC

[GitHub] spark pull request: Fix LogisticRegression when standardization = ...

GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/11247

    Fix LogisticRegression when standardization = false && regParam = 0.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    
    ## How was the this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-13372

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11247
    
----
commit 474c78e7b7a054a017e1858ffd3fe86a88c894e7
Author: Yanbo Liang <yb...@gmail.com>
Date:   2016-02-18T06:40:08Z

    Fix LogisticRegression when standardization = false && regParam = 0.0

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Fix LogisticRegression when standardization = ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185569435
  
    **[Test build #51470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51470/consoleFull)** for PR 11247 at commit [`474c78e`](https://github.com/apache/spark/commit/474c78e7b7a054a017e1858ffd3fe86a88c894e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185577621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51470/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/11247#issuecomment-188428112

@yanboliang I share the same concern with you. However, user may have `standardization = false`, but still want to have a good convergency when the scales are quite different. For example, you can test it out that scaling one column by 100x, the corresponding coefficient should be shrunk by 100x. This can not be achieved without this trick. In R's GLMNET, they do the same trick to ensure this property. Although it may be confusing for users, since it's transparent to users, I think it's still okay.

As for your second point, in `LogisticRegressionWithLBFGS`, when `standardization = false` and `regParam = 0.0`, the solution will be identical to `standardization = true` and `regParam = 0.0`, so the users still can get the correct answer. The breaking change in `LogisticRegressionWithLBFGS` is mainly solving the issue of regularizing the intercept, and https://github.com/apache/spark/pull/10788/files#diff-c78e117e05337bd8f7151ddf9450047dL402 is just the side effect that we handle standardization better so the convergency is better for the problem with different scale when `standardization = false`.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang closed the pull request at:

    https://github.com/apache/spark/pull/11247


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11247#issuecomment-186040385

@dbtsai Thanks for clarification, I understand the consideration of the current implementation.
I know that if features are not standardized at all, it may result convergency issue when the features have very different scales. This is aligned with setting ```standardization = true``` as default value.
But to some special case(such as the test case which I mentioned above), disable standardization will accelerate training. I think the users should have the ability to control it.
Further more, I think enable or disable ```standardization``` will run the same route may make users confused. Users may look forward to different result or convergency rate(even if it's worse in some cases) for enable or disable ```standardization```, but they got the same one.
The only issue we should concern about is that if this change will break someone's working training. I think most of the working trainings may trained with regularization, it will has little effect. Even if users train without regularization, enable or disable ```standardization``` will produce the same result currently, users have high possibility to use the default setting.
I'm still open to hear your thoughts.

BTW, I send #11258 to fix the bug for ```LogisticRegressionWithLBFGS```.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185577616
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185644984
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51481/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185826603
  
    @yanboliang  In #7080, It was intentionally made that `standardization = false` will run the same route as `standardization = true` without regularization. Mathematically, this can be proven that they will converge into the same solution. 
    
    With your change, it seems that when `standardization = false`, the features are not standardized at all, and this will result convergency issue when the features have very different scales. As a result, in order to avoid this issue, the features will be always standardized no matter what, and then each components will be penalized different to correct this effect.
    
    I wonder if your test with less iterations are special case, since without standardization, some of the data will not be converged when I tested it. This may help some dataset, but I don't know if this will break someone's working training.
    
    BTW, can you submit the bug fix for `LogisticRegression.scala` in a separate PR?
    
    ```scala
           optimizer.getUpdater() match {
             case x: SquaredL2Updater => runWithMlLogisitcRegression(0.0)
             case x: L1Updater => runWithMlLogisitcRegression(1.0)
    ```
    
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185577514
  
    **[Test build #51470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51470/consoleFull)** for PR 11247 at commit [`474c78e`](https://github.com/apache/spark/commit/474c78e7b7a054a017e1858ffd3fe86a88c894e7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11247#discussion_r53290393
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -444,8 +444,8 @@ class LogisticRegressionWithLBFGS
             createModel(weights, mlLogisticRegresionModel.intercept)
           }
           optimizer.getUpdater() match {
    -        case x: SquaredL2Updater => runWithMlLogisitcRegression(1.0)
    -        case x: L1Updater => runWithMlLogisitcRegression(0.0)
    +        case x: SquaredL2Updater => runWithMlLogisitcRegression(0.0)
    +        case x: L1Updater => runWithMlLogisitcRegression(1.0)
    --- End diff --
    
    ```elasticNetParam``` == 0.0 means L2 penalty, fix this bug BTW.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185644683
  
    **[Test build #51481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51481/consoleFull)** for PR 11247 at commit [`b973a0e`](https://github.com/apache/spark/commit/b973a0e9ae6e0a48ea875e5c0173760ce245a6d6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185652125
  
    @iyounus  does this relate to your recent changes you've focused on? I might be imagining things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185632014
  
    **[Test build #51481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51481/consoleFull)** for PR 11247 at commit [`b973a0e`](https://github.com/apache/spark/commit/b973a0e9ae6e0a48ea875e5c0173760ce245a6d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-185644981
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11247#issuecomment-188668664
  
    @dbtsai I think you convinced me, and I have also checked the R glmnet implementation. The current behavior may be more make sense, so I will close this PR. Thanks for your kindly clarification!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11247#discussion_r53290854
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala ---
    @@ -394,6 +394,7 @@ class LogisticRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w
         lrA.optimizer.setNumIterations(numIteration)
         val lrB = new LogisticRegressionWithLBFGS().setIntercept(true).setFeatureScaling(false)
         lrB.optimizer.setNumIterations(numIteration)
    +    lrB.optimizer.setRegParam(0.001)
    --- End diff --
    
    If we do not penalize, it will not yield the same result in the scaled space due to poor convergence rate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org