You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2016/01/16 18:10:31 UTC

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/10788

    [SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be regularized

    The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
    The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
    Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.
    
    Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for @dbtsai to review.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10788.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10788
    
----
commit a529c013fa722748cbd1d3878e4ea3bed5b15181
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-22T20:54:59Z

    document plans

commit f9e26350d15d7d36b75ece4f4718797dbe2a0830
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-22T22:53:29Z

    Some progress.

commit 7ebbd566e20923efc32dee1cfcf12ea315259e30
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-22T23:16:18Z

    Keep track of the number of requested classes so that if its more than 2 we use the legacy implementation. Also allow pass through of initialWeights

commit ef2a9b0f5b6cb2e971c2e5371f3394b4dec64574
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-22T23:48:06Z

    Expose a train on instances method within Spark, use numOfLinearPredictors instead of keeping track of class variable, pass through persistence information

commit 407491e38b1a5834d26a137ab20829a3d96f5142
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-24T01:14:04Z

    tests are fun

commit e02bf3a9688d1efa2f3da60b3d9f27911b04955d
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-24T07:42:13Z

    Start updating the tests to run with different updaters.

commit 8517539d0e8829833968dcb7e47ad8ba20849cb1
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-24T08:00:36Z

    get the tests compiling

commit a619d42b821575afd8efa90f2a38edf9690eb0df
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-24T08:04:53Z

    style fixed

commit 4febcc32f524edadeb68dc674e2681a087ffaa38
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-24T08:13:23Z

    make the test method private

commit e8e03a13ba04c6b3100e290a5c435959c2f01912
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-24T20:16:13Z

    CR feedback, pass RDD of Labeled points to ml implemetnation. Also from tests require that feature scaling is turned on to use ml implementation.

commit 38a024bd9a36e83ef8005a5f2af8a4dd44f6760e
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-25T07:24:21Z

    Convert it to a df and use set for the inital params

commit 478b8c5d5ff20478dc4ba913b0c77172e0abdfff
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-25T20:06:57Z

    Handle non-dense weights

commit 08589f58b81bc1e6099b425f86226053c5b6a360
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-26T03:39:54Z

    CR feedback: make the setInitialWeights function private, don't mess with the weights when they are user supploed, validate that the user supplied weights are reasonable.

commit f40c401496ae1e6cc7b39db820fea194d42c25c5
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-05-26T04:19:46Z

    style fix up

commit f35a16aa8110a33c32959db674908d145be6e97f
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-06-02T23:29:11Z

    Copy the number of iterations, convergence tolerance, and if we are fitting an intercept from mllib to ml when training lbfgs model using ml code

commit 4d431a358074f5245abcbc95af3e2bdf75b4f21d
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-06-03T00:39:48Z

    scala style check issue

commit 7e4192849efc6d282633159a15c7dd41376aa1a3
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-06-03T07:30:48Z

    Only the weights if we need to.

commit ed351ffdf862994389b41284f95aa148c6550f41
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-06-03T19:39:56Z

    Use appendBias for adding intercept to initial weights , fix generateInitialWeights

commit 3ac02d72cab72b35b7cc76c50d7088d4b98bfd9d
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-06-08T20:20:19Z

    Merge in master

commit d1ce12ba45f12d93b962ffd560242757eda739c2
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-07-09T20:13:21Z

    Merge in master

commit 8ca0fa927bd2773ceb4ccf740445058ead706f7a
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-08-28T21:57:51Z

    attempt to merge in master

commit 6f66f2cbc7d80335bfb0e2e5b8b430930206d06f
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-10-01T23:05:01Z

    Merge in master (again)

commit 0cedd50368eeda594eafdb9500ed162ff33f2e25
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2015-10-02T01:44:08Z

    Fix compile error after simple merge

commit 2bf289b2ab92ff9da742d22e1feda0b57f8a796c
Author: Holden Karau <ho...@us.ibm.com>
Date:   2015-12-30T18:41:30Z

    Merge branch 'master' into SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized

commit d7a26318be962eede7d6fa0792f1f4d72178dc8d
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-01-16T03:21:04Z

    Merge in master

commit b0fe1e68bf8e7fc13cc845db90e7eb27729545d9
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-01-16T03:24:08Z

    scala style import order fix

commit 827dcdec09414c5b25a66be359c4d651a9e18ee6
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-01-16T06:24:33Z

    Import ordering

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369141
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    --- End diff --
    
    `vec` is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-173519318
  
    **[Test build #49871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49871/consoleFull)** for PR 10788 at commit [`46ae406`](https://github.com/apache/spark/commit/46ae406e7d9935ba2d75a092e98622578fb4ce15).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49961836
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1 && useFeatureScaling) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        if (userSuppliedWeights) {
    +          val initialWeightsWithIntercept = if (addIntercept) {
    +            appendBias(initialWeights)
    +          } else {
    +            initialWeights
    +          }
    +          lr.setInitialWeights(initialWeightsWithIntercept)
    --- End diff --
    
    Here will be
    
    ```scala
    lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(uid, initialWeights, 1))
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172968528
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49963160
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") (
           new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
         }
     
    -    val initialCoefficientsWithIntercept =
    -      Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
    +    val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures
    +    val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept)
    --- End diff --
    
    Ah then there is no validation step we just assume if they set the initial model they set a valid initial model. Ok :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172468868
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962512
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    --- End diff --
    
    Replace `algorithm` by `Logistic Regression`, and add a new line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172246749
  
    **[Test build #49535 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49535/consoleFull)** for PR 10788 at commit [`ba99ce9`](https://github.com/apache/spark/commit/ba99ce990f0004593908d7da05bdd9b0d0cb4c4f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49961912
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialCoefficients: Option[Vector] = None
    +  /** @group setParam */
    +  private[spark] def setInitialWeights(value: Vector): this.type = {
    +    this.optInitialCoefficients = Some(value)
    +    this
    +  }
    +
    +  /**
    +   * Validate the initial weights, return an Option, if not the expected size return None
    +   * and log a warning.
    +   */
    +  private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = {
    --- End diff --
    
    validateCoefficients


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172231480
  
    **[Test build #49532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49532/consoleFull)** for PR 10788 at commit [`827dcde`](https://github.com/apache/spark/commit/827dcdec09414c5b25a66be359c4d651a9e18ee6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369552
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
    +          /**
    +           * For binary logistic regression, when we initialize the coefficients as zeros,
    +           * it will converge faster if we initialize the intercept such that
    +           * it follows the distribution of the labels.
    +
    --- End diff --
    
    remove the extra line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962879
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    --- End diff --
    
    I'm assuming you meant `run(input, input, userSuppliedWeights = true)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172973109
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172968912
  
    **[Test build #49697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49697/consoleFull)** for PR 10788 at commit [`7501b4b`](https://github.com/apache/spark/commit/7501b4b29d0d08d1363cb1f16be1397887a569b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50370414
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), userSuppliedWeights = false)
    +  }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, userSuppliedWeights = true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        lr.setStandardization(useFeatureScaling)
    +        if (userSuppliedWeights) {
    +          val uid = Identifiable.randomUID("logreg-static")
    +          lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(
    +            uid, initialWeights, 1.0))
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    +        // Train our model
    +        val mlLogisticRegresionModel = lr.train(df)
    +        // unpersist if we persisted
    +        if (handlePersistence) {
    +          df.unpersist()
    +        }
    +        // convert the model
    +        val weights = mlLogisticRegresionModel.weights match {
    +          case x: DenseVector => x
    +          case y: Vector => Vectors.dense(y.toArray)
    +        }
    +        createModel(weights, mlLogisticRegresionModel.intercept)
    +      }
    +      optimizer.getUpdater() match {
    --- End diff --
    
    when `optimizer.getRegParam() == 0.0`, run the old version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172468869
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49586/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962416
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    --- End diff --
    
    `run(input, generateInitialWeights(input), userSuppliedWeights = false)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-175380873
  
    **[Test build #50158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50158/consoleFull)** for PR 10788 at commit [`8016ad8`](https://github.com/apache/spark/commit/8016ad814a359e2e8d300c84b52a1a021f13b9dc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369730
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -374,11 +395,11 @@ class LogisticRegression @Since("1.2.0") (
               throw new SparkException(msg)
             }
     
    -        /*
    -           The coefficients are trained in the scaled space; we're converting them back to
    -           the original space.
    -           Note that the intercept in scaled space and original space is the same;
    -           as a result, no scaling is needed.
    +        /**
    +         * The coefficients are trained in the scaled space; we're converting them back to
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962541
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1 && useFeatureScaling) {
    --- End diff --
    
    You can remove `useFeatureScaling`, and pass it as setStandardization in ML implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172459633
  
    **[Test build #49586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49586/consoleFull)** for PR 10788 at commit [`43a3a32`](https://github.com/apache/spark/commit/43a3a3246f793d467751f40b4dceba6ccaed394b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369516
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
    +          /**
    +           * For binary logistic regression, when we initialize the coefficients as zeros,
    +           * it will converge faster if we initialize the intercept such that
    +           * it follows the distribution of the labels.
    +
    +           * {{{
    +           * P(0) = 1 / (1 + \exp(b)), and
    +           * P(1) = \exp(b) / (1 + \exp(b))
    +           * }}}, hence
    +           * {{{
    +           * b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    +           * }}}
                */
    -          initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
    -            histogram(1) / histogram(0))
    +          initialCoefficientsWithIntercept.toArray(numFeatures)
    +          = math.log(histogram(1) / histogram(0))
    --- End diff --
    
    add two spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369207
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) {
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172434712
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-175341628
  
    Thanks. Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174682157
  
    **[Test build #50011 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50011/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-173493541
  
    LGTM except some styling issues, and concern about caching twice. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50931598
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
               /*
                  For binary logistic regression, when we initialize the coefficients as zeros,
                  it will converge faster if we initialize the intercept such that
                  it follows the distribution of the labels.
     
                  {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    +             P(0) = 1 / (1 + \exp(b)), and
    +             P(1) = \exp(b) / (1 + \exp(b))
                  }}}, hence
                  {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    +             b = \log{P(1) / P(0)} = \log{count_1 / count_0}
                  }}}
                */
    -          initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
    -            histogram(1) / histogram(0))
    +          initialCoefficientsWithIntercept.toArray(numFeatures)
    +            = math.log(histogram(1) / histogram(0))
    --- End diff --
    
    revert this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172233210
  
    Also related is https://github.com/apache/spark/pull/6771 which avoids the round trip through DataFrames (although since it now works on Instances rather than LabeledPoints under the hood it got a bit more complicated) so its probably OK to have the round trip (but just to keep as an option).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172430774
  
    **[Test build #49570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49570/consoleFull)** for PR 10788 at commit [`0e2ea49`](https://github.com/apache/spark/commit/0e2ea495ad0020f89df9e70653ff380673d3563e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172442232
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49571/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50931610
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
               /*
                  For binary logistic regression, when we initialize the coefficients as zeros,
                  it will converge faster if we initialize the intercept such that
                  it follows the distribution of the labels.
     
                  {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    +             P(0) = 1 / (1 + \exp(b)), and
    +             P(1) = \exp(b) / (1 + \exp(b))
                  }}}, hence
                  {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    +             b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    --- End diff --
    
    put two spaces back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369078
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    --- End diff --
    
    How can this compile? Should be `optInitialModel.get.coefficients.size != numFeatures`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962964
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    --- End diff --
    
    haha... yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-175381442
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50158/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49963357
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1 && useFeatureScaling) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        if (userSuppliedWeights) {
    +          val initialWeightsWithIntercept = if (addIntercept) {
    +            appendBias(initialWeights)
    +          } else {
    +            initialWeights
    +          }
    +          lr.setInitialWeights(initialWeightsWithIntercept)
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    --- End diff --
    
    that makes sense. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49963463
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialCoefficients: Option[Vector] = None
    +
    +  /** @group setParam */
    +  private[spark] def setInitialModel(model: LogisticRegressionModel): this.type = {
    +    this.optInitialCoefficients = Some(model.coefficients)
    --- End diff --
    
    You don't want to lose the information of intercept in model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369722
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
    +          /**
    +           * For binary logistic regression, when we initialize the coefficients as zeros,
    +           * it will converge faster if we initialize the intercept such that
    +           * it follows the distribution of the labels.
    +
    +           * {{{
    +           * P(0) = 1 / (1 + \exp(b)), and
    +           * P(1) = \exp(b) / (1 + \exp(b))
    +           * }}}, hence
    +           * {{{
    +           * b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    +           * }}}
                */
    -          initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
    -            histogram(1) / histogram(0))
    +          initialCoefficientsWithIntercept.toArray(numFeatures)
    +          = math.log(histogram(1) / histogram(0))
             }
     
             val states = optimizer.iterations(new CachedDiffFunction(costFun),
               initialCoefficientsWithIntercept.toBreeze.toDenseVector)
     
    -        /*
    -           Note that in Logistic Regression, the objective history (loss + regularization)
    -           is log-likelihood which is invariance under feature standardization. As a result,
    -           the objective history from optimizer is the same as the one in the original space.
    +        /**
    +         * Note that in Logistic Regression, the objective history (loss + regularization)
    --- End diff --
    
    reverse the style change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172973112
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49692/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50372852
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), userSuppliedWeights = false)
    +  }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, userSuppliedWeights = true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        lr.setStandardization(useFeatureScaling)
    +        if (userSuppliedWeights) {
    +          val uid = Identifiable.randomUID("logreg-static")
    +          lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(
    +            uid, initialWeights, 1.0))
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    --- End diff --
    
    Good point, in a previous version of the code we passed handlePersistence down through to avoid this. I've updated it to do the same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172244017
  
    **[Test build #49535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49535/consoleFull)** for PR 10788 at commit [`ba99ce9`](https://github.com/apache/spark/commit/ba99ce990f0004593908d7da05bdd9b0d0cb4c4f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50372566
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
    +          /**
    +           * For binary logistic regression, when we initialize the coefficients as zeros,
    +           * it will converge faster if we initialize the intercept such that
    +           * it follows the distribution of the labels.
    +
    --- End diff --
    
    Ok, looking at the rest of the comments in the file & the style guide it seems to mostly have the `*` but I'll put them back in (it also break auto indent to not have them but thats an emacs bug)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962157
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") (
           new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
         }
     
    -    val initialCoefficientsWithIntercept =
    -      Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
    +    val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures
    +    val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept)
    --- End diff --
    
    let's handle it through setInitialModel, and have another PR to make it public. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962408
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    --- End diff --
    
    Replace `algorithm` by `Logistic Regression`, and remove `starting from the initial weights provided`.
    
    Add a new line between `of LabeledPoint entries` and `If a known updater is used`.
    
    Actually, in ml version, disabling feature scaling is supported now. So please call ml implementation in this case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172979039
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49697/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49964184
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") (
           new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
         }
     
    -    val initialCoefficientsWithIntercept =
    -      Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
    +    val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures
    +    val initialCoefficientsWithIntercept = optInitialCoefficients.getOrElse(
    +      Vectors.zeros(numFeaturesWithIntercept))
     
    --- End diff --
    
    btw, may we want to log. `if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures)`, let's log it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172237941
  
    **[Test build #49532 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49532/consoleFull)** for PR 10788 at commit [`827dcde`](https://github.com/apache/spark/commit/827dcdec09414c5b25a66be359c4d651a9e18ee6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172434715
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49570/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172246795
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962944
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") (
           new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
         }
     
    -    val initialCoefficientsWithIntercept =
    -      Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
    +    val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures
    +    val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept)
    --- End diff --
    
    You will know # of features by the size of coefficients set by setInitialModel. There is no ambiguity here since it's binary, and intercept has a separate variable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174690199
  
    **[Test build #50024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50024/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172427338
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172468731
  
    **[Test build #49586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49586/consoleFull)** for PR 10788 at commit [`43a3a32`](https://github.com/apache/spark/commit/43a3a3246f793d467751f40b4dceba6ccaed394b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50370273
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), userSuppliedWeights = false)
    +  }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, userSuppliedWeights = true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        lr.setStandardization(useFeatureScaling)
    +        if (userSuppliedWeights) {
    +          val uid = Identifiable.randomUID("logreg-static")
    +          lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(
    +            uid, initialWeights, 1.0))
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    +        // Train our model
    +        val mlLogisticRegresionModel = lr.train(df)
    +        // unpersist if we persisted
    +        if (handlePersistence) {
    +          df.unpersist()
    +        }
    +        // convert the model
    +        val weights = mlLogisticRegresionModel.weights match {
    --- End diff --
    
    ```scala
    val weights = Vectors.dense(mlLogisticRegresionModel.coefficients.toArray)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-173519471
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50931620
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
               /*
                  For binary logistic regression, when we initialize the coefficients as zeros,
                  it will converge faster if we initialize the intercept such that
                  it follows the distribution of the labels.
     
                  {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    +             P(0) = 1 / (1 + \exp(b)), and
    --- End diff --
    
    put two spaces back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172442231
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174661379
  
    @dbtsai should have addressed the style concerns, let me know if anything else shows up :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50372397
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    --- End diff --
    
    its used on L348 in the log warning


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49961964
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialCoefficients: Option[Vector] = None
    +  /** @group setParam */
    +  private[spark] def setInitialWeights(value: Vector): this.type = {
    +    this.optInitialCoefficients = Some(value)
    +    this
    +  }
    +
    +  /**
    +   * Validate the initial weights, return an Option, if not the expected size return None
    +   * and log a warning.
    +   */
    +  private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = {
    +    vectorOpt.flatMap(vec =>
    +      if (vec.size == numFeatures) {
    +        Some(vec)
    +      } else {
    +        logWarning(
    +          s"""Initial weights provided (${vec})did not match the expected size ${numFeatures}""")
    +        None
    +      })
    +  }
    +
    +  override protected[spark] def train(dataset: DataFrame): LogisticRegressionModel = {
         val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).map {
    +    val instances = dataset.select(col($(labelCol)), w, col($(featuresCol))).map {
    --- End diff --
    
    why this line is changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172957321
  
    **[Test build #49692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49692/consoleFull)** for PR 10788 at commit [`e1b0389`](https://github.com/apache/spark/commit/e1b038926b7506cfa240883ae177785a24cc9870).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172237970
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174688059
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172421474
  
    **[Test build #49568 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49568/consoleFull)** for PR 10788 at commit [`4caab8c`](https://github.com/apache/spark/commit/4caab8ca2ac23f24fe84cf741ab2c013e319752d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174714408
  
    **[Test build #50024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50024/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172427339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49568/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50041320
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    --- End diff --
    
    Removed `starting from the initial weights provided.` and add extra new line here for readability. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172434615
  
    **[Test build #49570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49570/consoleFull)** for PR 10788 at commit [`0e2ea49`](https://github.com/apache/spark/commit/0e2ea495ad0020f89df9e70653ff380673d3563e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962722
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1 && useFeatureScaling) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        if (userSuppliedWeights) {
    +          val initialWeightsWithIntercept = if (addIntercept) {
    +            appendBias(initialWeights)
    +          } else {
    +            initialWeights
    +          }
    +          lr.setInitialWeights(initialWeightsWithIntercept)
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    +        // Train our model
    +        val mlLogisticRegresionModel = lr.train(df)
    +        // unpersist if we persisted
    +        if (handlePersistence) {
    +          df.unpersist()
    +        }
    --- End diff --
    
    ditto?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174682295
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50011/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174682293
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-175341287
  
    **[Test build #50158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50158/consoleFull)** for PR 10788 at commit [`8016ad8`](https://github.com/apache/spark/commit/8016ad814a359e2e8d300c84b52a1a021f13b9dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174715017
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50024/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172972956
  
    **[Test build #49692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49692/consoleFull)** for PR 10788 at commit [`e1b0389`](https://github.com/apache/spark/commit/e1b038926b7506cfa240883ae177785a24cc9870).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50370169
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), userSuppliedWeights = false)
    +  }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, userSuppliedWeights = true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        lr.setStandardization(useFeatureScaling)
    +        if (userSuppliedWeights) {
    +          val uid = Identifiable.randomUID("logreg-static")
    +          lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(
    +            uid, initialWeights, 1.0))
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    --- End diff --
    
    Will this cause double caching? Let's say input RDD is cached, so `handlePersistence` will be false. As a result, `df == StorageLevel.NONE` will be true in ml's LOR code, and this will cause caching twice. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962990
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1 && useFeatureScaling) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        if (userSuppliedWeights) {
    +          val initialWeightsWithIntercept = if (addIntercept) {
    +            appendBias(initialWeights)
    +          } else {
    +            initialWeights
    +          }
    +          lr.setInitialWeights(initialWeightsWithIntercept)
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    +        // Train our model
    +        val mlLogisticRegresionModel = lr.train(df)
    +        // unpersist if we persisted
    +        if (handlePersistence) {
    +          df.unpersist()
    +        }
    --- End diff --
    
    same


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962716
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1 && useFeatureScaling) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        if (userSuppliedWeights) {
    +          val initialWeightsWithIntercept = if (addIntercept) {
    +            appendBias(initialWeights)
    +          } else {
    +            initialWeights
    +          }
    +          lr.setInitialWeights(initialWeightsWithIntercept)
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    --- End diff --
    
    Why do we need to do it? I through those check is already in ML code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50369668
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    -          /*
    -             For binary logistic regression, when we initialize the coefficients as zeros,
    -             it will converge faster if we initialize the intercept such that
    -             it follows the distribution of the labels.
    -
    -             {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    -             }}}, hence
    -             {{{
    -               b = \log{P(1) / P(0)} = \log{count_1 / count_0}
    -             }}}
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
    +          /**
    +           * For binary logistic regression, when we initialize the coefficients as zeros,
    +           * it will converge faster if we initialize the intercept such that
    +           * it follows the distribution of the labels.
    +
    --- End diff --
    
    I think u have to remove all the `*`. I think we decide to do comment like
    
    ```
    /*
        Start the sentence.
     */


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10788


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49961714
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialWeights: Option[Vector] = None
    +  /** @group setParam */
    +  private[spark] def setInitialWeights(value: Vector): this.type = {
    +    this.optInitialWeights = Some(value)
    +    this
    +  }
    --- End diff --
    
    How about we follow https://github.com/apache/spark/pull/8972 , and have the following code. We can create another seprate JIRA for moving `setInitialModel` to public with a sharedParam.
    
    ```scala
      private var initialModel: Option[LogisticRegressionModel] = None
    
      private def setInitialModel(model: LogisticRegressionModel): this.type = {
        ...
        ...
        this
      }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172978875
  
    **[Test build #49697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49697/consoleFull)** for PR 10788 at commit [`7501b4b`](https://github.com/apache/spark/commit/7501b4b29d0d08d1363cb1f16be1397887a569b1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172979036
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174662125
  
    I'm going through the caching logic now. Will let you know soon. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50040773
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -322,10 +329,25 @@ class LogisticRegression @Since("1.2.0") (
           new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
         }
     
    +    val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures
    --- End diff --
    
    Is this used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49963912
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") (
           new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
         }
     
    -    val initialCoefficientsWithIntercept =
    -      Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
    +    val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures
    +    val initialCoefficientsWithIntercept = optInitialCoefficients.getOrElse(
    +      Vectors.zeros(numFeaturesWithIntercept))
     
    --- End diff --
    
    here, 
    
    ```scala
    val initialCoefficientsWithIntercept =	
       Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
    
    if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) {
      val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
      optInitialModel.get.coefficients.foreachActive { case (index, value) =>
        initialCoefficientsWithInterceptArray(index) = value
      }
      if ($(fitIntercept) {
              initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
      } 
    } else if ($(fitIntercept)) {
          /*
             For binary logistic regression, when we initialize the coefficients as zeros,
             it will converge faster if we initialize the intercept such that
             it follows the distribution of the labels.
             {{{
             P(0) = 1 / (1 + \exp(b)), and
             P(1) = \exp(b) / (1 + \exp(b))
             }}}, hence
             {{{
             b = \log{P(1) / P(0)} = \log{count_1 / count_0}
             }}}
           */
          initialCoefficientsWithIntercept.toArray(numFeatures)
            = math.log(histogram(1) / histogram(0))
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172431216
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172442212
  
    **[Test build #49571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49571/consoleFull)** for PR 10788 at commit [`699997f`](https://github.com/apache/spark/commit/699997f2b9d22ddb0e8c8391d5c744b8895e91e4).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172968530
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49696/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-175381440
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-173519472
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49871/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-173506673
  
    **[Test build #49871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49871/consoleFull)** for PR 10788 at commit [`46ae406`](https://github.com/apache/spark/commit/46ae406e7d9935ba2d75a092e98622578fb4ce15).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172246796
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49535/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172431217
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49569/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49961903
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialCoefficients: Option[Vector] = None
    +  /** @group setParam */
    +  private[spark] def setInitialWeights(value: Vector): this.type = {
    +    this.optInitialCoefficients = Some(value)
    +    this
    +  }
    +
    +  /**
    +   * Validate the initial weights, return an Option, if not the expected size return None
    +   * and log a warning.
    +   */
    +  private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = {
    +    vectorOpt.flatMap(vec =>
    +      if (vec.size == numFeatures) {
    +        Some(vec)
    +      } else {
    +        logWarning(
    +          s"""Initial weights provided (${vec})did not match the expected size ${numFeatures}""")
    --- End diff --
    
    btw, why `s"""`, also change `weights` to coefficients


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49959401
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialWeights: Option[Vector] = None
    +  /** @group setParam */
    +  private[spark] def setInitialWeights(value: Vector): this.type = {
    +    this.optInitialWeights = Some(value)
    +    this
    +  }
    +
    +  /** Validate the initial weights, return an Option, if not the expected size return None
    --- End diff --
    
    Move `Validate .... into a new line`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174665118
  
    **[Test build #50011 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50011/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50931631
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
             val initialCoefficientsWithIntercept =
               Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
     
    -        if ($(fitIntercept)) {
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) {
    +          val vec = optInitialModel.get.coefficients
    +          logWarning(
    +            s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}")
    +        }
    +
    +        if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) {
    +          val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray
    +          optInitialModel.get.coefficients.foreachActive { case (index, value) =>
    +            initialCoefficientsWithInterceptArray(index) = value
    +          }
    +          if ($(fitIntercept)) {
    +            initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept
    +          }
    +        } else if ($(fitIntercept)) {
               /*
                  For binary logistic regression, when we initialize the coefficients as zeros,
                  it will converge faster if we initialize the intercept such that
                  it follows the distribution of the labels.
     
                  {{{
    -               P(0) = 1 / (1 + \exp(b)), and
    -               P(1) = \exp(b) / (1 + \exp(b))
    +             P(0) = 1 / (1 + \exp(b)), and
    +             P(1) = \exp(b) / (1 + \exp(b))
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49961906
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialWeights: Option[Vector] = None
    +  /** @group setParam */
    +  private[spark] def setInitialWeights(value: Vector): this.type = {
    +    this.optInitialWeights = Some(value)
    +    this
    +  }
    --- End diff --
    
    So we have setInitialWeights on StreamingLogisticRegressionWithSGD - would it be better to have it match StreamingLogisticRegressionWithSGD ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962987
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1 && useFeatureScaling) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        if (userSuppliedWeights) {
    +          val initialWeightsWithIntercept = if (addIntercept) {
    +            appendBias(initialWeights)
    +          } else {
    +            initialWeights
    +          }
    +          lr.setInitialWeights(initialWeightsWithIntercept)
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    --- End diff --
    
    So the ML code checks on the DataFrame - which will never be cached. So we check on the user supplied input and if the user supplied input is not persisted we handle our own persistance but if the user supplied input is persisted then we don't.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172437627
  
    **[Test build #49571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49571/consoleFull)** for PR 10788 at commit [`699997f`](https://github.com/apache/spark/commit/699997f2b9d22ddb0e8c8391d5c744b8895e91e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172427291
  
    **[Test build #49568 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49568/consoleFull)** for PR 10788 at commit [`4caab8c`](https://github.com/apache/spark/commit/4caab8ca2ac23f24fe84cf741ab2c013e319752d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962737
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") (
           new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
         }
     
    -    val initialCoefficientsWithIntercept =
    -      Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
    +    val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures
    +    val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept)
    --- End diff --
    
    I don't think we know the number of features at that point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49959559
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialWeights: Option[Vector] = None
    --- End diff --
    
    We started to change our terminology from `weights` into `coefficients` after the implementation of weighted LoR. Please change it into `optInitialCoefficients`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50371017
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), userSuppliedWeights = false)
    +  }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   *
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, userSuppliedWeights = true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        lr.setStandardization(useFeatureScaling)
    +        if (userSuppliedWeights) {
    +          val uid = Identifiable.randomUID("logreg-static")
    +          lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(
    +            uid, initialWeights, 1.0))
    +        }
    +        lr.setFitIntercept(addIntercept)
    +        lr.setMaxIter(optimizer.getNumIterations())
    +        lr.setTol(optimizer.getConvergenceTol())
    +        // Convert our input into a DataFrame
    +        val sqlContext = new SQLContext(input.context)
    +        import sqlContext.implicits._
    +        val df = input.toDF()
    +        // Determine if we should cache the DF
    +        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
    +        if (handlePersistence) {
    +          df.persist(StorageLevel.MEMORY_AND_DISK)
    +        }
    +        // Train our model
    +        val mlLogisticRegresionModel = lr.train(df)
    +        // unpersist if we persisted
    +        if (handlePersistence) {
    +          df.unpersist()
    +        }
    +        // convert the model
    +        val weights = mlLogisticRegresionModel.weights match {
    +          case x: DenseVector => x
    +          case y: Vector => Vectors.dense(y.toArray)
    +        }
    +        createModel(weights, mlLogisticRegresionModel.intercept)
    +      }
    +      optimizer.getUpdater() match {
    --- End diff --
    
    okay, this will make the test harder to write. I don't care this one now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50041152
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -343,8 +365,8 @@ class LogisticRegression @Since("1.2.0") (
             = math.log(histogram(1) / histogram(0))
         }
     
    -    val states = optimizer.iterations(new CachedDiffFunction(costFun),
    -      initialCoefficientsWithIntercept.toBreeze.toDenseVector)
    +      val states = optimizer.iterations(new CachedDiffFunction(costFun),
    +        initialCoefficientsWithIntercept.toBreeze.toDenseVector)
    --- End diff --
    
    Wrong indentation. Remove two spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50041364
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), userSuppliedWeights = false)
    +  }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    --- End diff --
    
    Add extra new line before `If a known updater is...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-172237973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49532/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r50041495
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), userSuppliedWeights = false)
    +  }
    +
    +  /**
    +   * Run Logistic Regression with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, userSuppliedWeights = true)
    +  }
    +
    +  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
    +      LogisticRegressionModel = {
    +    // ml's Logisitic regression only supports binary classifcation currently.
    +    if (numOfLinearPredictor == 1) {
    +      def runWithMlLogisitcRegression(elasticNetParam: Double) = {
    +        // Prepare the ml LogisticRegression based on our settings
    +        val lr = new org.apache.spark.ml.classification.LogisticRegression()
    +        lr.setRegParam(optimizer.getRegParam())
    +        lr.setElasticNetParam(elasticNetParam)
    +        lr.setStandardization(useFeatureScaling)
    +        if (userSuppliedWeights) {
    +          val initialWeightsWithIntercept = if (addIntercept) {
    --- End diff --
    
    This is not used anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49962448
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ---
    @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
           new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)
         }
       }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * If using ml implementation, uses ml code to generate initial weights.
    +   */
    +  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    +    run(input, generateInitialWeights(input), false)
    +  }
    +
    +  /**
    +   * Run the algorithm with the configured parameters on an input RDD
    +   * of LabeledPoint entries starting from the initial weights provided.
    +   * If a known updater is used calls the ml implementation, to avoid
    +   * applying a regularization penalty to the intercept, otherwise
    +   * defaults to the mllib implementation. If more than two classes
    +   * or feature scaling is disabled, always uses mllib implementation.
    +   * Uses user provided weights.
    +   */
    +  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    +    run(input, initialWeights, true)
    --- End diff --
    
    `run(input, generateInitialWeights(input), userSuppliedWeights = true)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10788#discussion_r49963455
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---
    @@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") (
       @Since("1.5.0")
       override def getThresholds: Array[Double] = super.getThresholds
     
    -  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    -    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
    +  private var optInitialCoefficients: Option[Vector] = None
    --- End diff --
    
    Keep the reference to `InitialModel` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10788#issuecomment-174715010
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org