You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dbtsai <gi...@git.apache.org> on 2014/08/12 03:11:39 UTC

[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/1897

    [SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition number

    Scaling to minimize the condition number:
    During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge.
    GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale.
    See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do.
    Currently, it's only enabled in LogisticRegressionWithLBFGS


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AlpineNow/spark dbtsai-feature-scaling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1897
    
----
commit 5257751cda9cd0cb284af06c81e1282e1bfb53f7
Author: DB Tsai <db...@alpinenow.com>
Date:   2014-08-08T23:23:21Z

    Improve the convergence rate by minimize the condition number in LOR with LBFGS

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-51871303
  
    QA tests have started for PR 1897. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-51873603
  
    QA results for PR 1897:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1897#discussion_r16099253
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala ---
    @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]
           throw new SparkException("Input validation failed.")
         }
     
    +    /**
    +     * Scaling to minimize the condition number:
    --- End diff --
    
    `minimize the condition number` is not accurate. We can say `scaling columns to unit variance as a heuristic to reduce the condition number`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1897#discussion_r16099170
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala ---
    @@ -185,6 +185,58 @@ class LogisticRegressionSuite extends FunSuite with LocalSparkContext with Match
         // Test prediction on Array.
         validatePrediction(validationData.map(row => model.predict(row.features)), validationData)
       }
    +
    +  test("numerical stability of scaling features using logistic regression with LBFGS") {
    +    /**
    +     * If we rescale the features, the condition number will be changed so the convergence rate
    +     * and the solution will not equal to the original solution multiple by the scaling factor
    +     * which it should be.
    +     *
    +     * However, since in the LogisticRegressionWithLBFGS, we standardize the training dataset first,
    +     * no matter how we multiple a scaling factor into the dataset, the convergence rate should be
    +     * the same, and the solution should equal to the original solution multiple by the scaling
    +     * factor.
    +     */
    +
    +    val nPoints = 10000
    +    val A = 2.0
    +    val B = -1.5
    +
    +    val testData = LogisticRegressionSuite.generateLogisticInput(A, B, nPoints, 42)
    +
    +    val initialWeights = Vectors.dense(0.0)
    +
    +    val testRDD1 = sc.parallelize(testData, 2)
    +
    +    val testRDD2 = sc.parallelize(
    +      testData.map(x => LabeledPoint(x.label, Vectors.fromBreeze(x.features.toBreeze * 1.0E3))), 2)
    +
    +    val testRDD3 = sc.parallelize(
    +      testData.map(x => LabeledPoint(x.label, Vectors.fromBreeze(x.features.toBreeze * 1.0E6))), 2)
    +
    +    testRDD1.cache()
    +    testRDD2.cache()
    +    testRDD3.cache()
    +
    +    val lrA = new LogisticRegressionWithLBFGS().setIntercept(true)
    +    val lrB = new LogisticRegressionWithLBFGS().setIntercept(true).setFeatureScaling(false)
    +
    +    val modelA1 = lrA.run(testRDD1, initialWeights)
    +    val modelA2 = lrA.run(testRDD2, initialWeights)
    +    val modelA3 = lrA.run(testRDD3, initialWeights)
    +
    +    val modelB1 = lrB.run(testRDD1, initialWeights)
    +    val modelB2 = lrB.run(testRDD2, initialWeights)
    +    val modelB3 = lrB.run(testRDD3, initialWeights)
    +
    +    // Test the weights
    +    assert(modelA1.weights(0) ~== modelA2.weights(0) * 1.0E3 absTol 0.01)
    +    assert(modelA1.weights(0) ~== modelA3.weights(0) * 1.0E6 absTol 0.01)
    +
    +    assert(modelB1.weights(0) !~== modelB2.weights(0) * 1.0E3 absTol 0.1)
    --- End diff --
    
    need a comment about the purpose of the tests here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52149162
  
    Seems that Jenkins is not stable. Failing on issues related to akka.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52226905
  
    LGTM. Merged into both master and branch-1.1. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52149464
  
    QA tests have started for PR 1897. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52148128
  
    QA results for PR 1897:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1897#discussion_r16099150
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala ---
    @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]
           throw new SparkException("Input validation failed.")
         }
     
    +    /**
    +     * Scaling to minimize the condition number:
    +     *
    +     * During the optimization process, the convergence (rate) depends on the condition number of
    +     * the training dataset. Scaling the variables often reduces this condition number, thus
    +     * improving the convergence rate dramatically. Without reducing the condition number,
    +     * some training datasets mixing the columns with different scales may not be able to converge.
    +     *
    +     * GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return
    +     * the weights in the original scale.
    +     * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    +     *
    +     * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing
    +     * the variance of each column (without subtracting the mean), and train the model in the
    +     * scaled space. Then we transform the coefficients from the scaled space to the original scale
    +     * as GLMNET and LIBSVM do.
    +     *
    +     * Currently, it's only enabled in LogisticRegressionWithLBFGS
    +     */
    +    val scaler = if (useFeatureScaling) {
    +      (new StandardScaler).fit(input.map(x => x.features))
    +    } else {
    +      null
    +    }
    +
         // Prepend an extra variable consisting of all 1.0's for the intercept.
         val data = if (addIntercept) {
    -      input.map(labeledPoint => (labeledPoint.label, appendBias(labeledPoint.features)))
    +      if(useFeatureScaling) {
    +        input.map(labeledPoint =>
    +          (labeledPoint.label, appendBias(scaler.transform(labeledPoint.features))))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, appendBias(labeledPoint.features)))
    +      }
         } else {
    -      input.map(labeledPoint => (labeledPoint.label, labeledPoint.features))
    +      if (useFeatureScaling) {
    +        input.map(labeledPoint => (labeledPoint.label, scaler.transform(labeledPoint.features)))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, labeledPoint.features))
    --- End diff --
    
    should use `input` itself


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-51870344
  
    QA tests have started for PR 1897. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-51865332
  
    QA results for PR 1897:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52145394
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52149135
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52145716
  
    QA tests have started for PR 1897. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-51872737
  
    QA results for PR 1897:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1897


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-52152780
  
    QA results for PR 1897:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1897#issuecomment-51862223
  
    QA tests have started for PR 1897. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1897#discussion_r16221810
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala ---
    @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]
           throw new SparkException("Input validation failed.")
         }
     
    +    /**
    +     * Scaling to minimize the condition number:
    +     *
    +     * During the optimization process, the convergence (rate) depends on the condition number of
    +     * the training dataset. Scaling the variables often reduces this condition number, thus
    +     * improving the convergence rate dramatically. Without reducing the condition number,
    +     * some training datasets mixing the columns with different scales may not be able to converge.
    +     *
    +     * GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return
    +     * the weights in the original scale.
    +     * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    +     *
    +     * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing
    +     * the variance of each column (without subtracting the mean), and train the model in the
    +     * scaled space. Then we transform the coefficients from the scaled space to the original scale
    +     * as GLMNET and LIBSVM do.
    +     *
    +     * Currently, it's only enabled in LogisticRegressionWithLBFGS
    +     */
    +    val scaler = if (useFeatureScaling) {
    +      (new StandardScaler).fit(input.map(x => x.features))
    +    } else {
    +      null
    +    }
    +
         // Prepend an extra variable consisting of all 1.0's for the intercept.
         val data = if (addIntercept) {
    -      input.map(labeledPoint => (labeledPoint.label, appendBias(labeledPoint.features)))
    +      if(useFeatureScaling) {
    +        input.map(labeledPoint =>
    +          (labeledPoint.label, appendBias(scaler.transform(labeledPoint.features))))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, appendBias(labeledPoint.features)))
    +      }
         } else {
    -      input.map(labeledPoint => (labeledPoint.label, labeledPoint.features))
    +      if (useFeatureScaling) {
    +        input.map(labeledPoint => (labeledPoint.label, scaler.transform(labeledPoint.features)))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, labeledPoint.features))
    --- End diff --
    
    Sorry, I didn't realize that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1897#discussion_r16153527
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala ---
    @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]
           throw new SparkException("Input validation failed.")
         }
     
    +    /**
    +     * Scaling to minimize the condition number:
    +     *
    +     * During the optimization process, the convergence (rate) depends on the condition number of
    +     * the training dataset. Scaling the variables often reduces this condition number, thus
    +     * improving the convergence rate dramatically. Without reducing the condition number,
    +     * some training datasets mixing the columns with different scales may not be able to converge.
    +     *
    +     * GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return
    +     * the weights in the original scale.
    +     * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    +     *
    +     * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing
    +     * the variance of each column (without subtracting the mean), and train the model in the
    +     * scaled space. Then we transform the coefficients from the scaled space to the original scale
    +     * as GLMNET and LIBSVM do.
    +     *
    +     * Currently, it's only enabled in LogisticRegressionWithLBFGS
    +     */
    +    val scaler = if (useFeatureScaling) {
    +      (new StandardScaler).fit(input.map(x => x.features))
    +    } else {
    +      null
    +    }
    +
         // Prepend an extra variable consisting of all 1.0's for the intercept.
         val data = if (addIntercept) {
    -      input.map(labeledPoint => (labeledPoint.label, appendBias(labeledPoint.features)))
    +      if(useFeatureScaling) {
    +        input.map(labeledPoint =>
    +          (labeledPoint.label, appendBias(scaler.transform(labeledPoint.features))))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, appendBias(labeledPoint.features)))
    +      }
         } else {
    -      input.map(labeledPoint => (labeledPoint.label, labeledPoint.features))
    +      if (useFeatureScaling) {
    +        input.map(labeledPoint => (labeledPoint.label, scaler.transform(labeledPoint.features)))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, labeledPoint.features))
    --- End diff --
    
    It's not identical map. It's converting labeledPoint to tuple of response and feature vector for optimizer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org