You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2015/09/03 18:53:26 UTC

[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/8588

    [WIP][SPARK-9834][MLLIB] implement weighted least squares via normal equation

    This is a WIP. Please do not spend time on documentation or code style.
    
    The goal is to have a weighted least squares implementation that can later provide R-like summary statistics and support IRLS.
    
    I try to match the result from glmnet but I found it quite hard to figure out how it handles regularization.
    
    @dbtsai 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-9834

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8588.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8588
    
----
commit 34107aa9654d9a7418fc5db9a67de556efde42c1
Author: Xiangrui Meng <me...@databricks.com>
Date:   2015-09-03T07:24:43Z

    implement weighted least squares via normal equation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38959015
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    --- End diff --
    
    This might be used by other algorithms like log-linear model or Lp regression. We can discuss this later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137525352
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137650245
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138673301
  
    LGTM, did not check low level implementation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138657110
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138673406
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8588


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137538143
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41979/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38775271
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    + *   + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^,
    + *
    + * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by
    + * [[standardizeLabel]] and [[standardizeFeatures]], respectively.
    + *
    + * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to
    + * match R's `lm`.
    + * Turn on [[standardizeLabel]] to match R's `glmnet`.
    + *
    + * @param fitIntercept whether to fit intercept. If false, z is 0.0.
    + * @param regParam L2 regularization parameter (lambda)
    + * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the
    + *                            population standard deviation of the j-th column of A. Otherwise,
    + *                            sigma,,j,, is 1.0.
    + * @param standardizeLabel whether to standardize label. If true, delta is the population standard
    + *                         deviation of the label column b. Otherwise, delta is 1.0.
    + */
    +private[ml] class WeightedLeastSquares(
    +    val fitIntercept: Boolean,
    +    val regParam: Double,
    +    val standardizeFeatures: Boolean,
    +    val standardizeLabel: Boolean) extends Logging with Serializable {
    +  import WeightedLeastSquares._
    +
    +  require(regParam >= 0.0, s"regParam cannot be negative: $regParam")
    +  if (regParam == 0.0) {
    +    logWarning("regParam is zero, which might cause numerical instability and overfit.")
    +  }
    +
    +  /**
    +   * Creates a [[WeightedLeastSquaresModel]] from an RDD of [[Instance]]s.
    +   */
    +  def fit(instances: RDD[Instance]): WeightedLeastSquaresModel = {
    +    val summary = instances.treeAggregate(new Aggregator)(_.add(_), _.merge(_))
    +    summary.validate()
    +    logInfo(s"Number of instances: ${summary.count}.")
    +    val triK = summary.triK
    +    val bBar = summary.bBar
    +    val bStd = summary.bStd
    +    val aBar = summary.aBar
    +    val aVar = summary.aVar
    +    val abBar = summary.abBar
    +    val aaBar = summary.aaBar
    +    val aaValues = aaBar.values
    +
    +    if (fitIntercept) {
    +      // shift centers
    +      // A^T A - aBar aBar^T
    +      RowMatrix.dspr(-1.0, aBar, aaValues)
    +      // A^T b - bBar aBar
    +      BLAS.axpy(-bBar, aBar, abBar)
    +    }
    +
    +    // add regularization to diagonals
    +    var i = 0
    +    var j = 2
    +    while (i < triK) {
    +      var lambda = regParam
    +      if (standardizeFeatures) {
    +        lambda *= aVar(j - 2)
    +      }
    +      if (standardizeLabel) {
    +        // TODO: handle the case when bStd = 0
    +        lambda /= bStd
    +      }
    +      aaValues(i) += lambda
    +      i += j
    +      j += 1
    +    }
    +
    +    val x = choleskySolve(aaBar.values, abBar)
    +
    +    // compute intercept
    +    val intercept = if (fitIntercept) {
    +      bBar - BLAS.dot(aBar, x)
    +    } else {
    +      0.0
    +    }
    +
    +    new WeightedLeastSquaresModel(x, intercept)
    +  }
    +
    +  /**
    +   * Solves a symmetric positive definite linear system via Cholesky factorization.
    +   * The input arguments are modified in-place to store the factorization and the solution.
    +   * @param A the upper triangular part of A
    +   * @param bx right-hand side
    +   * @return the solution vector
    +   */
    +  private def choleskySolve(A: Array[Double], bx: DenseVector): DenseVector = {
    +    val k = bx.size
    +    val info = new intW(0)
    +    lapack.dppsv("U", k, 1, A, bx.values, k, info)
    +    val code = info.`val`
    +    assert(code == 0, s"lapack.dpotrs returned $code.")
    +    bx
    +  }
    +}
    +
    +private[ml] object WeightedLeastSquares {
    +
    +  /**
    +   * Case class for weighted observations.
    +   * @param w weight, must be positive
    +   * @param a features
    +   * @param b label
    +   */
    +  case class Instance(w: Double, a: Vector, b: Double) {
    +    require(w >= 0.0, s"Weight cannot be negative: $w.")
    +  }
    +
    +  /**
    +   * Aggregator to provide necessary summary statistics for solving [[WeightedLeastSquares]].
    +   */
    +  // TODO: consolidate aggregates for summary statistics
    +  private class Aggregator extends Serializable {
    +    var initialized: Boolean = false
    +    var k: Int = _
    +    var count: Long = _
    +    var triK: Int = _
    +    private var wSum: Double = _
    +    private var wwSum: Double = _
    +    private var bSum: Double = _
    +    private var bbSum: Double = _
    +    private var aSum: DenseVector = _
    +    private var abSum: DenseVector = _
    +    private var aaSum: DenseVector = _
    +
    +    private def init(k: Int): Unit = {
    +      require(k <= 4096, "In order to take the normal equation approach efficiently, " +
    +        s"we set the max number of features to 4096 but got $k.")
    +      this.k = k
    +      triK = k * (k + 1) / 2
    +      count = 0L
    +      wSum = 0.0
    +      wwSum = 0.0
    +      bSum = 0.0
    +      bbSum = 0.0
    +      aSum = new DenseVector(Array.ofDim(k))
    +      abSum = new DenseVector(Array.ofDim(k))
    +      aaSum = new DenseVector(Array.ofDim(triK))
    +      initialized = true
    +    }
    +
    +    /**
    +     * Adds an instance.
    +     */
    +    def add(instance: Instance): this.type = {
    +      val Instance(w, a, b) = instance
    +      val ak = a.size
    +      if (!initialized) {
    +        init(ak)
    +        initialized = true
    +      }
    +      assert(ak == k, s"Dimension mismatch. Expect vectors of size $k but got $ak.")
    +      count += 1L
    +      wSum += w
    +      wwSum += w * w
    +      bSum += w * b
    +      bbSum += w * b * b
    +      BLAS.axpy(w, a, aSum)
    +      BLAS.axpy(w * b, a, abSum)
    +      RowMatrix.dspr(w, a, aaSum.values)
    +      this
    +    }
    +
    +    /**
    +     * Merges another [[Aggregator]].
    +     */
    +    def merge(other: Aggregator): this.type = {
    +      if (!other.initialized) {
    +        this
    +      } else {
    +        if (!initialized) {
    --- End diff --
    
    If `this` is not initialized but `other` is, can we just return `other`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137512428
  
      [Test build #41977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41977/consoleFull) for   PR 8588 at commit [`34107aa`](https://github.com/apache/spark/commit/34107aa9654d9a7418fc5db9a67de556efde42c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138652809
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137525357
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41977/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137537973
  
      [Test build #41979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41979/console) for   PR 8588 at commit [`c75ff92`](https://github.com/apache/spark/commit/c75ff923428052c68e43004f5f2488cf0b3d72dc).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Instance(w: Double, a: Vector, b: Double) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138652844
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138683180
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137510971
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137655130
  
      [Test build #41994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41994/console) for   PR 8588 at commit [`1614f22`](https://github.com/apache/spark/commit/1614f2280603df92af049e6deb016cb1de768e80).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Instance(w: Double, a: Vector, b: Double) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38959084
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    + *   + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^,
    + *
    + * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by
    + * [[standardizeLabel]] and [[standardizeFeatures]], respectively.
    + *
    + * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to
    + * match R's `lm`.
    + * Turn on [[standardizeLabel]] to match R's `glmnet`.
    + *
    + * @param fitIntercept whether to fit intercept. If false, z is 0.0.
    + * @param regParam L2 regularization parameter (lambda)
    + * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the
    + *                            population standard deviation of the j-th column of A. Otherwise,
    + *                            sigma,,j,, is 1.0.
    + * @param standardizeLabel whether to standardize label. If true, delta is the population standard
    + *                         deviation of the label column b. Otherwise, delta is 1.0.
    + */
    +private[ml] class WeightedLeastSquares(
    +    val fitIntercept: Boolean,
    +    val regParam: Double,
    +    val standardizeFeatures: Boolean,
    +    val standardizeLabel: Boolean) extends Logging with Serializable {
    +  import WeightedLeastSquares._
    +
    +  require(regParam >= 0.0, s"regParam cannot be negative: $regParam")
    +  if (regParam == 0.0) {
    +    logWarning("regParam is zero, which might cause numerical instability and overfit.")
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137650754
  
      [Test build #41994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41994/consoleFull) for   PR 8588 at commit [`1614f22`](https://github.com/apache/spark/commit/1614f2280603df92af049e6deb016cb1de768e80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137538139
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137650250
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137655182
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41994/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138673319
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38772511
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala ---
    @@ -0,0 +1,133 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.optim.WeightedLeastSquares.Instance
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.rdd.RDD
    +
    +class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext {
    +
    +  private var instances: RDD[Instance] = _
    +
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    /*
    +       R code:
    +
    +A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    +b <- c(17, 19, 23, 29)
    +w <- c(1, 2, 3, 4)
    +     */
    +    instances = sc.parallelize(Seq(
    +      Instance(1.0, Vectors.dense(0.0, 5.0).toSparse, 17.0),
    +      Instance(2.0, Vectors.dense(1.0, 7.0), 19.0),
    +      Instance(3.0, Vectors.dense(2.0, 11.0), 23.0),
    +      Instance(4.0, Vectors.dense(3.0, 13.0), 29.0)
    +    ), 2)
    +  }
    +
    +  test("WLS against lm") {
    +    /*
    +       R code:
    +
    +df <- as.data.frame(cbind(A, b))
    +for (formula in c(b ~ . -1, b ~ .)) {
    +  model <- lm(formula, data=df, weights=w)
    +  print(as.vector(coef(model)))
    +}
    +
    +[1] -3.727121  3.009983
    +[1] 18.08  6.08 -0.60
    +     */
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, -3.727121, 3.009983),
    +      Vectors.dense(18.08, 6.08, -0.60))
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      val wls = new WeightedLeastSquares(
    +        fitIntercept, regParam = 0.0, standardizeFeatures = false, standardizeLabel = false)
    --- End diff --
    
    Do we need `standardizeLabel`? I think without regularization, with/without standardization will not change the solution. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38959027
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    + *   + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^,
    + *
    + * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by
    + * [[standardizeLabel]] and [[standardizeFeatures]], respectively.
    + *
    + * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to
    + * match R's `lm`.
    + * Turn on [[standardizeLabel]] to match R's `glmnet`.
    + *
    + * @param fitIntercept whether to fit intercept. If false, z is 0.0.
    + * @param regParam L2 regularization parameter (lambda)
    + * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the
    + *                            population standard deviation of the j-th column of A. Otherwise,
    + *                            sigma,,j,, is 1.0.
    + * @param standardizeLabel whether to standardize label. If true, delta is the population standard
    + *                         deviation of the label column b. Otherwise, delta is 1.0.
    + */
    +private[ml] class WeightedLeastSquares(
    +    val fitIntercept: Boolean,
    +    val regParam: Double,
    +    val standardizeFeatures: Boolean,
    +    val standardizeLabel: Boolean) extends Logging with Serializable {
    +  import WeightedLeastSquares._
    +
    +  require(regParam >= 0.0, s"regParam cannot be negative: $regParam")
    +  if (regParam == 0.0) {
    +    logWarning("regParam is zero, which might cause numerical instability and overfit.")
    +  }
    +
    +  /**
    +   * Creates a [[WeightedLeastSquaresModel]] from an RDD of [[Instance]]s.
    +   */
    +  def fit(instances: RDD[Instance]): WeightedLeastSquaresModel = {
    +    val summary = instances.treeAggregate(new Aggregator)(_.add(_), _.merge(_))
    +    summary.validate()
    +    logInfo(s"Number of instances: ${summary.count}.")
    +    val triK = summary.triK
    +    val bBar = summary.bBar
    +    val bStd = summary.bStd
    +    val aBar = summary.aBar
    +    val aVar = summary.aVar
    +    val abBar = summary.abBar
    +    val aaBar = summary.aaBar
    +    val aaValues = aaBar.values
    +
    +    if (fitIntercept) {
    +      // shift centers
    +      // A^T A - aBar aBar^T
    +      RowMatrix.dspr(-1.0, aBar, aaValues)
    +      // A^T b - bBar aBar
    +      BLAS.axpy(-bBar, aBar, abBar)
    +    }
    +
    +    // add regularization to diagonals
    +    var i = 0
    +    var j = 2
    +    while (i < triK) {
    +      var lambda = regParam
    +      if (standardizeFeatures) {
    +        lambda *= aVar(j - 2)
    +      }
    +      if (standardizeLabel) {
    +        // TODO: handle the case when bStd = 0
    +        lambda /= bStd
    +      }
    +      aaValues(i) += lambda
    +      i += j
    +      j += 1
    +    }
    +
    +    val x = choleskySolve(aaBar.values, abBar)
    +
    +    // compute intercept
    +    val intercept = if (fitIntercept) {
    +      bBar - BLAS.dot(aBar, x)
    +    } else {
    +      0.0
    +    }
    +
    +    new WeightedLeastSquaresModel(x, intercept)
    +  }
    +
    +  /**
    +   * Solves a symmetric positive definite linear system via Cholesky factorization.
    +   * The input arguments are modified in-place to store the factorization and the solution.
    +   * @param A the upper triangular part of A
    +   * @param bx right-hand side
    +   * @return the solution vector
    +   */
    +  private def choleskySolve(A: Array[Double], bx: DenseVector): DenseVector = {
    +    val k = bx.size
    +    val info = new intW(0)
    +    lapack.dppsv("U", k, 1, A, bx.values, k, info)
    +    val code = info.`val`
    +    assert(code == 0, s"lapack.dpotrs returned $code.")
    +    bx
    +  }
    +}
    +
    +private[ml] object WeightedLeastSquares {
    +
    +  /**
    +   * Case class for weighted observations.
    +   * @param w weight, must be positive
    +   * @param a features
    +   * @param b label
    +   */
    +  case class Instance(w: Double, a: Vector, b: Double) {
    +    require(w >= 0.0, s"Weight cannot be negative: $w.")
    +  }
    +
    +  /**
    +   * Aggregator to provide necessary summary statistics for solving [[WeightedLeastSquares]].
    +   */
    +  // TODO: consolidate aggregates for summary statistics
    +  private class Aggregator extends Serializable {
    +    var initialized: Boolean = false
    +    var k: Int = _
    +    var count: Long = _
    +    var triK: Int = _
    +    private var wSum: Double = _
    +    private var wwSum: Double = _
    +    private var bSum: Double = _
    +    private var bbSum: Double = _
    +    private var aSum: DenseVector = _
    +    private var abSum: DenseVector = _
    +    private var aaSum: DenseVector = _
    +
    +    private def init(k: Int): Unit = {
    +      require(k <= 4096, "In order to take the normal equation approach efficiently, " +
    +        s"we set the max number of features to 4096 but got $k.")
    +      this.k = k
    +      triK = k * (k + 1) / 2
    +      count = 0L
    +      wSum = 0.0
    +      wwSum = 0.0
    +      bSum = 0.0
    +      bbSum = 0.0
    +      aSum = new DenseVector(Array.ofDim(k))
    +      abSum = new DenseVector(Array.ofDim(k))
    +      aaSum = new DenseVector(Array.ofDim(triK))
    +      initialized = true
    +    }
    +
    +    /**
    +     * Adds an instance.
    +     */
    +    def add(instance: Instance): this.type = {
    +      val Instance(w, a, b) = instance
    +      val ak = a.size
    +      if (!initialized) {
    +        init(ak)
    +        initialized = true
    +      }
    +      assert(ak == k, s"Dimension mismatch. Expect vectors of size $k but got $ak.")
    +      count += 1L
    +      wSum += w
    +      wwSum += w * w
    +      bSum += w * b
    +      bbSum += w * b * b
    +      BLAS.axpy(w, a, aSum)
    +      BLAS.axpy(w * b, a, abSum)
    +      RowMatrix.dspr(w, a, aaSum.values)
    +      this
    +    }
    +
    +    /**
    +     * Merges another [[Aggregator]].
    +     */
    +    def merge(other: Aggregator): this.type = {
    +      if (!other.initialized) {
    +        this
    +      } else {
    +        if (!initialized) {
    +          init(other.k)
    +        }
    +        assert(k == other.k)
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138771604
  
    Merged into master. I will make follow-up PRs to do the refactoring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138673377
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38959031
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    --- End diff --
    
    SPARK-10490


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38774759
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    --- End diff --
    
    This contains the cost function solved by `ALS.LeastSquaresNESolver` (and duplicates the Cholesky `dppsv` solver); should we make a JIRA to refactor existing code to reuse this class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138683028
  
      [Test build #42145 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42145/console) for   PR 8588 at commit [`c2ec746`](https://github.com/apache/spark/commit/c2ec746ab6e9aee84dc984c912ab4f0ee2b4e75e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Instance(w: Double, a: Vector, b: Double) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137528926
  
      [Test build #41979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41979/consoleFull) for   PR 8588 at commit [`c75ff92`](https://github.com/apache/spark/commit/c75ff923428052c68e43004f5f2488cf0b3d72dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138657114
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42142/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138674766
  
      [Test build #42145 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42145/consoleFull) for   PR 8588 at commit [`c2ec746`](https://github.com/apache/spark/commit/c2ec746ab6e9aee84dc984c912ab4f0ee2b4e75e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38959092
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    + *   + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^,
    + *
    + * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by
    + * [[standardizeLabel]] and [[standardizeFeatures]], respectively.
    + *
    + * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to
    + * match R's `lm`.
    + * Turn on [[standardizeLabel]] to match R's `glmnet`.
    + *
    + * @param fitIntercept whether to fit intercept. If false, z is 0.0.
    + * @param regParam L2 regularization parameter (lambda)
    + * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the
    + *                            population standard deviation of the j-th column of A. Otherwise,
    + *                            sigma,,j,, is 1.0.
    + * @param standardizeLabel whether to standardize label. If true, delta is the population standard
    + *                         deviation of the label column b. Otherwise, delta is 1.0.
    + */
    +private[ml] class WeightedLeastSquares(
    +    val fitIntercept: Boolean,
    +    val regParam: Double,
    +    val standardizeFeatures: Boolean,
    +    val standardizeLabel: Boolean) extends Logging with Serializable {
    +  import WeightedLeastSquares._
    +
    +  require(regParam >= 0.0, s"regParam cannot be negative: $regParam")
    +  if (regParam == 0.0) {
    +    logWarning("regParam is zero, which might cause numerical instability and overfit.")
    +  }
    +
    +  /**
    +   * Creates a [[WeightedLeastSquaresModel]] from an RDD of [[Instance]]s.
    +   */
    +  def fit(instances: RDD[Instance]): WeightedLeastSquaresModel = {
    +    val summary = instances.treeAggregate(new Aggregator)(_.add(_), _.merge(_))
    +    summary.validate()
    +    logInfo(s"Number of instances: ${summary.count}.")
    +    val triK = summary.triK
    +    val bBar = summary.bBar
    +    val bStd = summary.bStd
    +    val aBar = summary.aBar
    +    val aVar = summary.aVar
    +    val abBar = summary.abBar
    +    val aaBar = summary.aaBar
    +    val aaValues = aaBar.values
    +
    +    if (fitIntercept) {
    +      // shift centers
    +      // A^T A - aBar aBar^T
    +      RowMatrix.dspr(-1.0, aBar, aaValues)
    +      // A^T b - bBar aBar
    +      BLAS.axpy(-bBar, aBar, abBar)
    +    }
    +
    +    // add regularization to diagonals
    +    var i = 0
    +    var j = 2
    +    while (i < triK) {
    +      var lambda = regParam
    +      if (standardizeFeatures) {
    +        lambda *= aVar(j - 2)
    +      }
    +      if (standardizeLabel) {
    +        // TODO: handle the case when bStd = 0
    +        lambda /= bStd
    +      }
    +      aaValues(i) += lambda
    +      i += j
    +      j += 1
    +    }
    +
    +    val x = choleskySolve(aaBar.values, abBar)
    +
    +    // compute intercept
    +    val intercept = if (fitIntercept) {
    +      bBar - BLAS.dot(aBar, x)
    +    } else {
    +      0.0
    +    }
    +
    +    new WeightedLeastSquaresModel(x, intercept)
    +  }
    +
    +  /**
    +   * Solves a symmetric positive definite linear system via Cholesky factorization.
    +   * The input arguments are modified in-place to store the factorization and the solution.
    +   * @param A the upper triangular part of A
    +   * @param bx right-hand side
    +   * @return the solution vector
    +   */
    +  private def choleskySolve(A: Array[Double], bx: DenseVector): DenseVector = {
    +    val k = bx.size
    +    val info = new intW(0)
    +    lapack.dppsv("U", k, 1, A, bx.values, k, info)
    +    val code = info.`val`
    +    assert(code == 0, s"lapack.dpotrs returned $code.")
    +    bx
    +  }
    +}
    +
    +private[ml] object WeightedLeastSquares {
    +
    +  /**
    +   * Case class for weighted observations.
    +   * @param w weight, must be positive
    +   * @param a features
    +   * @param b label
    +   */
    +  case class Instance(w: Double, a: Vector, b: Double) {
    +    require(w >= 0.0, s"Weight cannot be negative: $w.")
    +  }
    +
    +  /**
    +   * Aggregator to provide necessary summary statistics for solving [[WeightedLeastSquares]].
    +   */
    +  // TODO: consolidate aggregates for summary statistics
    +  private class Aggregator extends Serializable {
    +    var initialized: Boolean = false
    +    var k: Int = _
    +    var count: Long = _
    +    var triK: Int = _
    +    private var wSum: Double = _
    +    private var wwSum: Double = _
    +    private var bSum: Double = _
    +    private var bbSum: Double = _
    +    private var aSum: DenseVector = _
    +    private var abSum: DenseVector = _
    +    private var aaSum: DenseVector = _
    +
    +    private def init(k: Int): Unit = {
    +      require(k <= 4096, "In order to take the normal equation approach efficiently, " +
    +        s"we set the max number of features to 4096 but got $k.")
    +      this.k = k
    +      triK = k * (k + 1) / 2
    +      count = 0L
    +      wSum = 0.0
    +      wwSum = 0.0
    +      bSum = 0.0
    +      bbSum = 0.0
    +      aSum = new DenseVector(Array.ofDim(k))
    +      abSum = new DenseVector(Array.ofDim(k))
    +      aaSum = new DenseVector(Array.ofDim(triK))
    +      initialized = true
    +    }
    +
    +    /**
    +     * Adds an instance.
    +     */
    +    def add(instance: Instance): this.type = {
    +      val Instance(w, a, b) = instance
    +      val ak = a.size
    +      if (!initialized) {
    +        init(ak)
    +        initialized = true
    +      }
    +      assert(ak == k, s"Dimension mismatch. Expect vectors of size $k but got $ak.")
    +      count += 1L
    +      wSum += w
    +      wwSum += w * w
    +      bSum += w * b
    +      bbSum += w * b * b
    +      BLAS.axpy(w, a, aSum)
    +      BLAS.axpy(w * b, a, abSum)
    +      RowMatrix.dspr(w, a, aaSum.values)
    +      this
    +    }
    +
    +    /**
    +     * Merges another [[Aggregator]].
    +     */
    +    def merge(other: Aggregator): this.type = {
    +      if (!other.initialized) {
    +        this
    +      } else {
    +        if (!initialized) {
    --- End diff --
    
    The contract of `merge` in Spark is that the first argument is mutable but not the second. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1068.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38959002
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala ---
    @@ -0,0 +1,133 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.optim.WeightedLeastSquares.Instance
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.rdd.RDD
    +
    +class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext {
    +
    +  private var instances: RDD[Instance] = _
    +
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    /*
    +       R code:
    +
    +A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    +b <- c(17, 19, 23, 29)
    +w <- c(1, 2, 3, 4)
    +     */
    +    instances = sc.parallelize(Seq(
    +      Instance(1.0, Vectors.dense(0.0, 5.0).toSparse, 17.0),
    +      Instance(2.0, Vectors.dense(1.0, 7.0), 19.0),
    +      Instance(3.0, Vectors.dense(2.0, 11.0), 23.0),
    +      Instance(4.0, Vectors.dense(3.0, 13.0), 29.0)
    +    ), 2)
    +  }
    +
    +  test("WLS against lm") {
    +    /*
    +       R code:
    +
    +df <- as.data.frame(cbind(A, b))
    +for (formula in c(b ~ . -1, b ~ .)) {
    +  model <- lm(formula, data=df, weights=w)
    +  print(as.vector(coef(model)))
    +}
    +
    +[1] -3.727121  3.009983
    +[1] 18.08  6.08 -0.60
    +     */
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, -3.727121, 3.009983),
    +      Vectors.dense(18.08, 6.08, -0.60))
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      val wls = new WeightedLeastSquares(
    +        fitIntercept, regParam = 0.0, standardizeFeatures = false, standardizeLabel = false)
    --- End diff --
    
    We don't need it but I think it is useful to list the values explicitly here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137527221
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137655181
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137525040
  
      [Test build #41977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41977/console) for   PR 8588 at commit [`34107aa`](https://github.com/apache/spark/commit/34107aa9654d9a7418fc5db9a67de556efde42c1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Instance(w: Double, a: Vector, b: Double) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-138683182
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42145/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38959020
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala ---
    @@ -0,0 +1,133 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.optim.WeightedLeastSquares.Instance
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.rdd.RDD
    +
    +class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext {
    +
    +  private var instances: RDD[Instance] = _
    +
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    /*
    +       R code:
    +
    +A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    --- End diff --
    
    This is actually easy for copying the code around. But I will update the indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38774976
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    + *   + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^,
    + *
    + * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by
    + * [[standardizeLabel]] and [[standardizeFeatures]], respectively.
    + *
    + * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to
    + * match R's `lm`.
    + * Turn on [[standardizeLabel]] to match R's `glmnet`.
    + *
    + * @param fitIntercept whether to fit intercept. If false, z is 0.0.
    + * @param regParam L2 regularization parameter (lambda)
    + * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the
    + *                            population standard deviation of the j-th column of A. Otherwise,
    + *                            sigma,,j,, is 1.0.
    + * @param standardizeLabel whether to standardize label. If true, delta is the population standard
    + *                         deviation of the label column b. Otherwise, delta is 1.0.
    + */
    +private[ml] class WeightedLeastSquares(
    +    val fitIntercept: Boolean,
    +    val regParam: Double,
    +    val standardizeFeatures: Boolean,
    +    val standardizeLabel: Boolean) extends Logging with Serializable {
    +  import WeightedLeastSquares._
    +
    +  require(regParam >= 0.0, s"regParam cannot be negative: $regParam")
    +  if (regParam == 0.0) {
    +    logWarning("regParam is zero, which might cause numerical instability and overfit.")
    --- End diff --
    
    nit: "overfitting"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137511010
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38773848
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala ---
    @@ -0,0 +1,133 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.optim.WeightedLeastSquares.Instance
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.rdd.RDD
    +
    +class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext {
    +
    +  private var instances: RDD[Instance] = _
    +
    +  override def beforeAll(): Unit = {
    +    super.beforeAll()
    +    /*
    +       R code:
    +
    +A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    --- End diff --
    
    The existing R code has usually been aligned with the `/*`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38772704
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    --- End diff --
    
    Will you merge this code into current `LinearRegression.scala`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-9834][MLLIB] implement weighted le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8588#issuecomment-137527241
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8588#discussion_r38773970
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -0,0 +1,295 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.optim
    +
    +import com.github.fommil.netlib.LAPACK.{getInstance => lapack}
    +import org.netlib.util.intW
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.linalg.distributed.RowMatrix
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Model fitted by [[WeightedLeastSquares]].
    + * @param coefficients model coefficients
    + * @param intercept model intercept
    + */
    +private[ml] class WeightedLeastSquaresModel(
    +    val coefficients: DenseVector,
    +    val intercept: Double) extends Serializable
    +
    +/**
    + * Weighted least squares solver via normal equation.
    + * Given weighted observations (w,,i,,, a,,i,,, b,,i,,), we use the following weighted least squares
    + * formulation:
    + *
    + * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i
    + *   + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^,
    + *
    + * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by
    + * [[standardizeLabel]] and [[standardizeFeatures]], respectively.
    + *
    + * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to
    + * match R's `lm`.
    + * Turn on [[standardizeLabel]] to match R's `glmnet`.
    + *
    + * @param fitIntercept whether to fit intercept. If false, z is 0.0.
    + * @param regParam L2 regularization parameter (lambda)
    + * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the
    + *                            population standard deviation of the j-th column of A. Otherwise,
    + *                            sigma,,j,, is 1.0.
    + * @param standardizeLabel whether to standardize label. If true, delta is the population standard
    + *                         deviation of the label column b. Otherwise, delta is 1.0.
    + */
    +private[ml] class WeightedLeastSquares(
    +    val fitIntercept: Boolean,
    +    val regParam: Double,
    +    val standardizeFeatures: Boolean,
    +    val standardizeLabel: Boolean) extends Logging with Serializable {
    +  import WeightedLeastSquares._
    +
    +  require(regParam >= 0.0, s"regParam cannot be negative: $regParam")
    +  if (regParam == 0.0) {
    +    logWarning("regParam is zero, which might cause numerical instability and overfit.")
    +  }
    +
    +  /**
    +   * Creates a [[WeightedLeastSquaresModel]] from an RDD of [[Instance]]s.
    +   */
    +  def fit(instances: RDD[Instance]): WeightedLeastSquaresModel = {
    +    val summary = instances.treeAggregate(new Aggregator)(_.add(_), _.merge(_))
    +    summary.validate()
    +    logInfo(s"Number of instances: ${summary.count}.")
    +    val triK = summary.triK
    +    val bBar = summary.bBar
    +    val bStd = summary.bStd
    +    val aBar = summary.aBar
    +    val aVar = summary.aVar
    +    val abBar = summary.abBar
    +    val aaBar = summary.aaBar
    +    val aaValues = aaBar.values
    +
    +    if (fitIntercept) {
    +      // shift centers
    +      // A^T A - aBar aBar^T
    +      RowMatrix.dspr(-1.0, aBar, aaValues)
    +      // A^T b - bBar aBar
    +      BLAS.axpy(-bBar, aBar, abBar)
    +    }
    +
    +    // add regularization to diagonals
    +    var i = 0
    +    var j = 2
    +    while (i < triK) {
    +      var lambda = regParam
    +      if (standardizeFeatures) {
    +        lambda *= aVar(j - 2)
    +      }
    +      if (standardizeLabel) {
    +        // TODO: handle the case when bStd = 0
    +        lambda /= bStd
    +      }
    +      aaValues(i) += lambda
    +      i += j
    +      j += 1
    +    }
    +
    +    val x = choleskySolve(aaBar.values, abBar)
    +
    +    // compute intercept
    +    val intercept = if (fitIntercept) {
    +      bBar - BLAS.dot(aBar, x)
    +    } else {
    +      0.0
    +    }
    +
    +    new WeightedLeastSquaresModel(x, intercept)
    +  }
    +
    +  /**
    +   * Solves a symmetric positive definite linear system via Cholesky factorization.
    +   * The input arguments are modified in-place to store the factorization and the solution.
    +   * @param A the upper triangular part of A
    +   * @param bx right-hand side
    +   * @return the solution vector
    +   */
    +  private def choleskySolve(A: Array[Double], bx: DenseVector): DenseVector = {
    +    val k = bx.size
    +    val info = new intW(0)
    +    lapack.dppsv("U", k, 1, A, bx.values, k, info)
    +    val code = info.`val`
    +    assert(code == 0, s"lapack.dpotrs returned $code.")
    +    bx
    +  }
    +}
    +
    +private[ml] object WeightedLeastSquares {
    +
    +  /**
    +   * Case class for weighted observations.
    +   * @param w weight, must be positive
    +   * @param a features
    +   * @param b label
    +   */
    +  case class Instance(w: Double, a: Vector, b: Double) {
    +    require(w >= 0.0, s"Weight cannot be negative: $w.")
    +  }
    +
    +  /**
    +   * Aggregator to provide necessary summary statistics for solving [[WeightedLeastSquares]].
    +   */
    +  // TODO: consolidate aggregates for summary statistics
    +  private class Aggregator extends Serializable {
    +    var initialized: Boolean = false
    +    var k: Int = _
    +    var count: Long = _
    +    var triK: Int = _
    +    private var wSum: Double = _
    +    private var wwSum: Double = _
    +    private var bSum: Double = _
    +    private var bbSum: Double = _
    +    private var aSum: DenseVector = _
    +    private var abSum: DenseVector = _
    +    private var aaSum: DenseVector = _
    +
    +    private def init(k: Int): Unit = {
    +      require(k <= 4096, "In order to take the normal equation approach efficiently, " +
    +        s"we set the max number of features to 4096 but got $k.")
    +      this.k = k
    +      triK = k * (k + 1) / 2
    +      count = 0L
    +      wSum = 0.0
    +      wwSum = 0.0
    +      bSum = 0.0
    +      bbSum = 0.0
    +      aSum = new DenseVector(Array.ofDim(k))
    +      abSum = new DenseVector(Array.ofDim(k))
    +      aaSum = new DenseVector(Array.ofDim(triK))
    +      initialized = true
    +    }
    +
    +    /**
    +     * Adds an instance.
    +     */
    +    def add(instance: Instance): this.type = {
    +      val Instance(w, a, b) = instance
    +      val ak = a.size
    +      if (!initialized) {
    +        init(ak)
    +        initialized = true
    +      }
    +      assert(ak == k, s"Dimension mismatch. Expect vectors of size $k but got $ak.")
    +      count += 1L
    +      wSum += w
    +      wwSum += w * w
    +      bSum += w * b
    +      bbSum += w * b * b
    +      BLAS.axpy(w, a, aSum)
    +      BLAS.axpy(w * b, a, abSum)
    +      RowMatrix.dspr(w, a, aaSum.values)
    +      this
    +    }
    +
    +    /**
    +     * Merges another [[Aggregator]].
    +     */
    +    def merge(other: Aggregator): this.type = {
    +      if (!other.initialized) {
    +        this
    +      } else {
    +        if (!initialized) {
    +          init(other.k)
    +        }
    +        assert(k == other.k)
    --- End diff --
    
    nit: this is missing an assertion error


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org