You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dbtsai <gi...@git.apache.org> on 2014/04/08 03:54:13 UTC

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/353

    SPARK-1157: L-BFGS Optimizer based on Breeze's implementation.

    This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr !
    
    When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way.
    
    Let's review how updater works when returning newWeights given the input parameters.
    
    w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that regGradient is function of w!
    If we set gradient = 0, thisIterStepSize = 1, then
    regGradient(w) = w - w'
    
    As a result, for regVal, it can be computed by 
    
        val regVal = updater.compute(
          weights,
          new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
    and for regGradient, it can be obtained by
    
          val regGradient = weights.sub(
            updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1)
    
    The PR includes the tests which compare the result with SGD with/without regularization.
    
    We did a comparison between LBFGS and SGD, and often we saw 10x less
    steps in LBFGS while the cost of per step is the same (just computing
    the gradient).
    
    The following is the paper by Prof. Ng at Stanford comparing different
    optimizers including LBFGS and SGD. They use them in the context of
    deep learning, but worth as reference.
    http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #353
    
----
commit 60c83350bb77aa640edd290a26e2a20281b7a3a8
Author: DB Tsai <db...@dbtsai.com>
Date:   2014-04-05T00:06:50Z

    L-BFGS Optimizer based on Breeze's implementation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40413939
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39804341
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460030
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    --- End diff --
    
    Why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39810849
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11463764
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    --- End diff --
    
    This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should at least have the same performance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459107
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param lineSearchTolerance - The tolerance to control the accuracy of the line search.
    +   * @param convTolerance - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    lineSearchTolerance: Double,
    +    convTolerance: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convTolerance)
    +
    +    val weights = Vectors.fromBreeze(
    +      lbfgs.minimize(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector))
    +
    +    logInfo("LBFGS.runMiniBatchSGD finished. Last 10 losses %s".format(
    +      lossHistory.takeRight(10).mkString(", ")))
    +
    +    (weights, lossHistory.toArray)
    +  }
    +
    +  class CostFun(
    --- End diff --
    
    mark it private and add doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11469009
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,209 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with LocalSparkContext with ShouldMatchers {
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  var convergenceTol = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add an extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  lazy val dataRDD = sc.parallelize(data, 2).cache()
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("LBFGS loss should be decreasing and match the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    // This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should
    +    // at least have the same performance. It's based on observation, no theoretically guaranteed.
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // GD converges a way slower than L-BFGS. To achieve 1% difference,
    +    // it requires 90 iterations in GD. No matter how hard we increase
    +    // the number of iterations in GD here, the lossGD will be always
    +    // larger than lossLBFGS. This is based on observation, no theoretically guaranteed
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.02,
    +      "LBFGS should match GD result within 2% difference.")
    +  }
    +
    +  test("LBFGS and Gradient Descent with L2 regularization should get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    val numGDIterations = 50
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    // The 2% difference here is based on observation, but is not theoretically guaranteed.
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.02),
    +      "The last losses of LBFGS and GD should be within 2% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.02) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.02),
    +      "The weight differences between LBFGS and GD should be within 2%.")
    +  }
    +
    +  test("The convergence criteria should work as we expect.") {
    +    val regParam = 0.0
    +
    +    /**
    +     * For the first run, we set the convergenceTol to 0.0, so that the algorithm will
    +     * run up to the maxNumIterations which is 8 here.
    +     */
    +    val initialWeightsWithIntercept = Vectors.dense(0.0, 0.0)
    +    maxNumIterations = 8
    --- End diff --
    
    I'm not sure whether this is risky if we turn on parallel testing. To be safe, define `var` locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11468996
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,257 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convergenceTol)
    +
    +    val weights = Vectors.fromBreeze(
    +      lbfgs.minimize(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector))
    +
    +    logInfo("LBFGS.runMiniBatchSGD finished. Last 10 losses %s".format(
    +      lossHistory.takeRight(10).mkString(", ")))
    +
    +    (weights, lossHistory.toArray)
    +  }
    +
    +  /**
    +   * CostFun implements Breeze's DiffFunction[T], which returns the loss and gradient
    +   * at a particular point (weights). It's used in Breeze's convex optimization routines.
    +   */
    +  private class CostFun(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    lossHistory: ArrayBuffer[Double],
    +    miniBatchSize: Double) extends DiffFunction[BDV[Double]] {
    +
    +    private var i = 0
    +
    +    override def calculate(weights: BDV[Double]) = {
    +      // Have a local copy to avoid the serialization of CostFun object which is not serializable.
    +      val localData = data
    +      val localGradient = gradient
    +
    +      val (gradientSum, lossSum) = localData.sample(false, miniBatchFraction, 42 + i)
    +        .aggregate((BDV.zeros[Double](weights.size), 0.0))(
    +          seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
    +            val l = localGradient.compute(
    +              features, label, Vectors.fromBreeze(weights), Vectors.fromBreeze(grad))
    +            (grad, loss + l)
    +          },
    +          combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
    +            (grad1 += grad2, loss1 + loss2)
    +          })
    +
    +      /**
    +       * regVal is sum of weight squares if it's L2 updater;
    +       * for other updater, the same logic is followed.
    +       */
    +      val regVal = updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
    +
    +      val loss = lossSum / miniBatchSize + regVal
    +      /**
    +       * It will return the gradient part of regularization using updater.
    +       *
    +       * Given the input parameters, the updater basically does the following,
    +       *
    +       * w' = w - thisIterStepSize * (gradient + regGradient(w))
    +       * Note that regGradient is function of w
    +       *
    +       * If we set gradient = 0, thisIterStepSize = 1, then
    +       *
    +       * regGradient(w) = w - w'
    +       *
    +       * TODO: We need to clean it up by separating the logic of regularization out
    +       *       from updater to regularizer.
    +       */
    +      val regGradient = weights - updater.compute(
    --- End diff --
    
    change the variable name to `gradient` or `gradientTotal`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39895140
  
    @mengxr  As you suggested, I moved the costFun to private CostFun class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40429267
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40414083
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14117/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11464736
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.05) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.05),
    +      "The weight differences between LBFGS and GD should be within 5% difference.")
    +  }
    +
    +  test("Test if the convergence criteria works as we expect.") {
    +    val regParam = 0.0
    +
    +    /**
    +     * For the first run, we set the convTolerance to 0.0, so that the algorithm will
    +     * run up to the maxNumIterations which is 8 here.
    +     */
    +    val initialWeightsWithIntercept = Vectors.dense(0.0, 0.0)
    +    maxNumIterations = 8
    +    convTolerance = 0
    +
    +    val (_, lossLBFGS1) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Note that the first loss is computed with initial weights,
    +    // so the total numbers of loss will be numbers of iterations + 1
    +    assert(lossLBFGS1.length == 9)
    +
    +    convTolerance = 0.1
    +    val (_, lossLBFGS2) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(lossLBFGS2.length == 4)
    +    assert((lossLBFGS2(2) - lossLBFGS2(3)) / lossLBFGS2(2) < convTolerance)
    +
    +    // With smaller convTolerance, it takes more steps.
    +    convTolerance = 0.01
    +    val (_, lossLBFGS3) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(lossLBFGS3.length == 6)
    +    assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) < convTolerance)
    +  }
    +}
    +
    --- End diff --
    
    Without this extra empty line, the jenkins will complain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39810855
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40177786
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40250825
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39804835
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40452444
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11457830
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    --- End diff --
    
    Scala imports `Array` by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40515042
  
    Thanks - merged this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40413955
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40250897
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11528087
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,203 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with LocalSparkContext with ShouldMatchers {
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add an extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  lazy val dataRDD = sc.parallelize(data, 2).cache()
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("LBFGS loss should be decreasing and match the result of Gradient Descent.") {
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +    val convergenceTol = 1e-12
    +    val maxNumIterations = 10
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      simpleUpdater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Since the cost function is convex, the loss is guaranteed to be monotonically decreasing with L-BFGS optimizer.
    +    // (SGD doesn't guarantee this, and the loss will be fluctuating in the optimization process.)
    +    assert((loss, loss.tail).zipped.forall(_ > _), "loss should be monotonically decreasing.")
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      simpleUpdater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // GD converges a way slower than L-BFGS. To achieve 1% difference,
    +    // it requires 90 iterations in GD. No matter how hard we increase
    +    // the number of iterations in GD here, the lossGD will be always
    +    // larger than lossLBFGS. This is based on observation, no theoretically guaranteed
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.02,
    +      "LBFGS should match GD result within 2% difference.")
    +  }
    +
    +  test("LBFGS and Gradient Descent with L2 regularization should get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +    val convergenceTol = 1e-12
    +    val maxNumIterations = 10
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    val numGDIterations = 50
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    // The 2% difference here is based on observation, but is not theoretically guaranteed.
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.02),
    +      "The last losses of LBFGS and GD should be within 2% difference.")
    +
    +    assert(compareDouble(weightLBFGS(0), weightGD(0), 0.02) &&
    +      compareDouble(weightLBFGS(1), weightGD(1), 0.02),
    +      "The weight differences between LBFGS and GD should be within 2%.")
    +  }
    +
    +  test("The convergence criteria should work as we expect.") {
    +    val regParam = 0.0
    +
    +    /**
    +     * For the first run, we set the convergenceTol to 0.0, so that the algorithm will
    +     * run up to the maxNumIterations which is 8 here.
    +     */
    +    val initialWeightsWithIntercept = Vectors.dense(0.0, 0.0)
    +    val maxNumIterations = 8
    +    var convergenceTol = 0.0
    +
    +    val (_, lossLBFGS1) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Note that the first loss is computed with initial weights,
    +    // so the total numbers of loss will be numbers of iterations + 1
    +    assert(lossLBFGS1.length == 9)
    +
    +    convergenceTol = 0.1
    +    val (_, lossLBFGS2) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Based on observation, lossLBFGS2 runs 3 iterations, no theoretically guaranteed.
    +    assert(lossLBFGS2.length == 4)
    +    assert((lossLBFGS2(2) - lossLBFGS2(3)) / lossLBFGS2(2) < convergenceTol)
    +
    +    convergenceTol = 0.01
    +    val (_, lossLBFGS3) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With smaller convergenceTol, it takes more steps.
    +    assert(lossLBFGS3.length > lossLBFGS2.length)
    +
    +    // Based on observation, lossLBFGS2 runs 5 iterations, no theoretically guaranteed.
    +    assert(lossLBFGS3.length == 6)
    +    assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) < convergenceTol)
    +  }
    +}
    --- End diff --
    
    Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11571564
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    --- End diff --
    
    Does the following fit?
    
    ~~~
    val costFun = 
      new CostFun(data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40252651
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11528239
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,203 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with LocalSparkContext with ShouldMatchers {
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add an extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  lazy val dataRDD = sc.parallelize(data, 2).cache()
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("LBFGS loss should be decreasing and match the result of Gradient Descent.") {
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +    val convergenceTol = 1e-12
    +    val maxNumIterations = 10
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      simpleUpdater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Since the cost function is convex, the loss is guaranteed to be monotonically decreasing with L-BFGS optimizer.
    --- End diff --
    
    line too long?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460273
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    --- End diff --
    
    Again, why? I think this is what you observed, but not theoretically guaranteed. For example, if we change the random seed, is it possible to break this test? If that happens, then someone will look at the test and ask the same question: "why?". Better put a comment saying this threshold is set based on observation, which might not hold if the underlying implementation change.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40177717
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458125
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    --- End diff --
    
    You don't need to declare the type info for primitive types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458037
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    --- End diff --
    
    Move `{` to the line above. Maybe `extends ...` fits in the `class ...` line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40174857
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14051/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459442
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param lineSearchTolerance - The tolerance to control the accuracy of the line search.
    +   * @param convTolerance - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    lineSearchTolerance: Double,
    +    convTolerance: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convTolerance)
    +
    +    val weights = Vectors.fromBreeze(
    +      lbfgs.minimize(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector))
    +
    +    logInfo("LBFGS.runMiniBatchSGD finished. Last 10 losses %s".format(
    +      lossHistory.takeRight(10).mkString(", ")))
    +
    +    (weights, lossHistory.toArray)
    +  }
    +
    +  class CostFun(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    lossHistory: ArrayBuffer[Double],
    +    miniBatchSize: Double) extends DiffFunction[BDV[Double]] {
    +
    +    private var i = 0
    +
    +    def calculate(weights: BDV[Double]) = {
    +      // Have a local copy to avoid the serialization of CostFun object which is not serializable.
    +      val localData = data
    +      val localGradient = gradient
    +
    +      val (gradientSum, lossSum) = localData.sample(false, miniBatchFraction, 42 + i)
    +        .aggregate((BDV.zeros[Double](weights.size), 0.0))(
    +          seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
    +            val l = localGradient.compute(
    +              features, label, Vectors.fromBreeze(weights), Vectors.fromBreeze(grad))
    +            (grad, loss + l)
    +          },
    +          combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
    +            (grad1 += grad2, loss1 + loss2)
    +          })
    +
    +      /**
    +       * regVal is sum of sqrt of weights if it's L2 updater;
    +       * for other updater, the same logic is followed.
    +       */
    +      val regVal = updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
    +
    +      val loss = lossSum / miniBatchSize + regVal
    +      /**
    +       * It will return the gradient part of regularization using updater.
    +       *
    +       * Given the input parameters, the updater basically does the following,
    +       *
    +       * w' = w - thisIterStepSize * (gradient + regGradient(w))
    +       * Note that regGradient is function of w
    +       *
    +       * If we set gradient = 0, thisIterStepSize = 1, then
    +       *
    +       * regGradient(w) = w - w'
    --- End diff --
    
    Put a TODO here. We need to clean it up later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40414081
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40032548
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11464121
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    --- End diff --
    
    I add the comment in the code as 
        // GD converges a way slower than L-BFGS. To achieve 1% difference,
        // it requires 90 iterations in GD. No matter how hard we increase
        // the number of iterations in GD here, the lossGD will be always
        // larger than lossLBFGS.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459647
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    --- End diff --
    
    space after `:`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40032538
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39899795
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13907/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459633
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    --- End diff --
    
    Use `LocalSparkContext` to avoid dealing with `sc` setup directly. There is one in MLlib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11457976
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    --- End diff --
    
    mark `gradient` and `updater` private 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11404515
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,251 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param lineSearchTolerance - The tolerance to control the accuracy of the line search.
    +   * @param convTolerance - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    lineSearchTolerance: Double,
    +    convTolerance: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +    var i = 0
    +
    +    val costFun = new DiffFunction[BDV[Double]] {
    --- End diff --
    
    For cost function, I intend to do it in this way because in the code of cost function, I want to access and modify variables outside the cost function, for example, "i", "lossHistory", and if I create a private class for this, it will be extra effort to achieve this without changing breeze DiffFunction signature. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40283050
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14077/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11571576
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convergenceTol)
    --- End diff --
    
    The argument names are not necessary here. Actually, the variable names tell more than the argument names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai closed the pull request at:

    https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40434459
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40281922
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40434626
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40281918
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458695
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    --- End diff --
    
    Append `override` to `def` so we know that it will inherit the docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39812368
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39895423
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11464013
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    --- End diff --
    
    It a reason number coming out of my mind. Just quick do a comparing. 
    With 10 iterations of L-BFGS, SGD needs 40 iterations to get 2% difference.
                                        ......     ,SGD needs 90 iterations to get 1% difference. 
    In all of the test, L-BFGS gives smaller loss.
    As a result, you can see how SGD converges really slow when the # of iterations are high.
    For here, I'll put 2% to make the test run faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39804443
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13872/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11605070
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    --- End diff --
    
    Have a char left. ^_^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11604731
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    --- End diff --
    
    https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
    I saw that we should use 4-space indentation. However, GradientDescent doesn't use the right indentation which confuses me. Will fix for that as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459992
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    --- End diff --
    
    Could you put an comment about the test here? Why 0.8 is a reasonable bound?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458258
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    --- End diff --
    
    change `mcsrch` to `line search`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458103
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    --- End diff --
    
    `conv` is not a common acronym for `convergence`, better use the full name. However, `tol` is a common acronym for `tolerance`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40452726
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40174782
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39899794
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11457867
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    --- End diff --
    
    Provide a reference for L-BFGS. Either the wikipedia page or the original paper should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40177788
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14052/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40434895
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459572
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param lineSearchTolerance - The tolerance to control the accuracy of the line search.
    +   * @param convTolerance - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    lineSearchTolerance: Double,
    +    convTolerance: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convTolerance)
    +
    +    val weights = Vectors.fromBreeze(
    +      lbfgs.minimize(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector))
    +
    +    logInfo("LBFGS.runMiniBatchSGD finished. Last 10 losses %s".format(
    +      lossHistory.takeRight(10).mkString(", ")))
    +
    +    (weights, lossHistory.toArray)
    +  }
    +
    +  class CostFun(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    lossHistory: ArrayBuffer[Double],
    +    miniBatchSize: Double) extends DiffFunction[BDV[Double]] {
    +
    +    private var i = 0
    +
    +    def calculate(weights: BDV[Double]) = {
    +      // Have a local copy to avoid the serialization of CostFun object which is not serializable.
    +      val localData = data
    +      val localGradient = gradient
    +
    +      val (gradientSum, lossSum) = localData.sample(false, miniBatchFraction, 42 + i)
    +        .aggregate((BDV.zeros[Double](weights.size), 0.0))(
    +          seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
    +            val l = localGradient.compute(
    +              features, label, Vectors.fromBreeze(weights), Vectors.fromBreeze(grad))
    +            (grad, loss + l)
    +          },
    +          combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
    +            (grad1 += grad2, loss1 + loss2)
    +          })
    +
    +      /**
    +       * regVal is sum of sqrt of weights if it's L2 updater;
    +       * for other updater, the same logic is followed.
    +       */
    +      val regVal = updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
    +
    +      val loss = lossSum / miniBatchSize + regVal
    +      /**
    +       * It will return the gradient part of regularization using updater.
    +       *
    +       * Given the input parameters, the updater basically does the following,
    +       *
    +       * w' = w - thisIterStepSize * (gradient + regGradient(w))
    +       * Note that regGradient is function of w
    +       *
    +       * If we set gradient = 0, thisIterStepSize = 1, then
    +       *
    +       * regGradient(w) = w - w'
    +       */
    +      val regGradient = weights - updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 1, 1, regParam)._1.toBreeze
    +
    +      // gradientTotal = gradientSum / miniBatchSize + regGradient
    +      val gradientTotal = (gradientSum :*= (1 / miniBatchSize)) :+= regGradient
    --- End diff --
    
    This is an AXPY operation. You can write
    
    ~~~
    // get gradient of the regularization term
    val gradient = weights - updater.compute ...
    // get gradient of the total loss
    brzAxpy(1.0 / miniBatchSize, gradientSum, gradient)
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459284
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param lineSearchTolerance - The tolerance to control the accuracy of the line search.
    +   * @param convTolerance - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    lineSearchTolerance: Double,
    +    convTolerance: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convTolerance)
    +
    +    val weights = Vectors.fromBreeze(
    +      lbfgs.minimize(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector))
    +
    +    logInfo("LBFGS.runMiniBatchSGD finished. Last 10 losses %s".format(
    +      lossHistory.takeRight(10).mkString(", ")))
    +
    +    (weights, lossHistory.toArray)
    +  }
    +
    +  class CostFun(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    lossHistory: ArrayBuffer[Double],
    +    miniBatchSize: Double) extends DiffFunction[BDV[Double]] {
    +
    +    private var i = 0
    +
    +    def calculate(weights: BDV[Double]) = {
    +      // Have a local copy to avoid the serialization of CostFun object which is not serializable.
    +      val localData = data
    +      val localGradient = gradient
    +
    +      val (gradientSum, lossSum) = localData.sample(false, miniBatchFraction, 42 + i)
    +        .aggregate((BDV.zeros[Double](weights.size), 0.0))(
    +          seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
    +            val l = localGradient.compute(
    +              features, label, Vectors.fromBreeze(weights), Vectors.fromBreeze(grad))
    +            (grad, loss + l)
    +          },
    +          combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
    +            (grad1 += grad2, loss1 + loss2)
    +          })
    +
    +      /**
    +       * regVal is sum of sqrt of weights if it's L2 updater;
    --- End diff --
    
    `sqrt` means `square root`. you want `sum of weight squares`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11468997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,257 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convergenceTol)
    +
    +    val weights = Vectors.fromBreeze(
    +      lbfgs.minimize(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector))
    +
    +    logInfo("LBFGS.runMiniBatchSGD finished. Last 10 losses %s".format(
    +      lossHistory.takeRight(10).mkString(", ")))
    +
    +    (weights, lossHistory.toArray)
    +  }
    +
    +  /**
    +   * CostFun implements Breeze's DiffFunction[T], which returns the loss and gradient
    +   * at a particular point (weights). It's used in Breeze's convex optimization routines.
    +   */
    +  private class CostFun(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    lossHistory: ArrayBuffer[Double],
    +    miniBatchSize: Double) extends DiffFunction[BDV[Double]] {
    +
    +    private var i = 0
    +
    +    override def calculate(weights: BDV[Double]) = {
    +      // Have a local copy to avoid the serialization of CostFun object which is not serializable.
    +      val localData = data
    +      val localGradient = gradient
    +
    +      val (gradientSum, lossSum) = localData.sample(false, miniBatchFraction, 42 + i)
    +        .aggregate((BDV.zeros[Double](weights.size), 0.0))(
    +          seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
    +            val l = localGradient.compute(
    +              features, label, Vectors.fromBreeze(weights), Vectors.fromBreeze(grad))
    +            (grad, loss + l)
    +          },
    +          combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
    +            (grad1 += grad2, loss1 + loss2)
    +          })
    +
    +      /**
    +       * regVal is sum of weight squares if it's L2 updater;
    +       * for other updater, the same logic is followed.
    +       */
    +      val regVal = updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
    +
    +      val loss = lossSum / miniBatchSize + regVal
    +      /**
    +       * It will return the gradient part of regularization using updater.
    +       *
    +       * Given the input parameters, the updater basically does the following,
    +       *
    +       * w' = w - thisIterStepSize * (gradient + regGradient(w))
    +       * Note that regGradient is function of w
    +       *
    +       * If we set gradient = 0, thisIterStepSize = 1, then
    +       *
    +       * regGradient(w) = w - w'
    +       *
    +       * TODO: We need to clean it up by separating the logic of regularization out
    +       *       from updater to regularizer.
    +       */
    +      val regGradient = weights - updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 1, 1, regParam)._1.toBreeze
    +
    +      // regGradient = gradientSum / miniBatchSize + regGradient
    +      axpy(1.0 / miniBatchSize, gradientSum, regGradient)
    +
    +      /**
    +       * NOTE: lossSum and loss is computed using the weights from the previous iteration
    +       * and regVal is the regularization value computed in the previous iteration as well.
    +       */
    +      lossHistory.append(loss)
    +
    +      i += 1
    +
    +      (loss, regGradient)
    +    }
    +  }
    +
    --- End diff --
    
    remove empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11468959
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,209 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with LocalSparkContext with ShouldMatchers {
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  var convergenceTol = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add an extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  lazy val dataRDD = sc.parallelize(data, 2).cache()
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("LBFGS loss should be decreasing and match the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    // This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should
    +    // at least have the same performance. It's based on observation, no theoretically guaranteed.
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // GD converges a way slower than L-BFGS. To achieve 1% difference,
    +    // it requires 90 iterations in GD. No matter how hard we increase
    +    // the number of iterations in GD here, the lossGD will be always
    +    // larger than lossLBFGS. This is based on observation, no theoretically guaranteed
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.02,
    +      "LBFGS should match GD result within 2% difference.")
    +  }
    +
    +  test("LBFGS and Gradient Descent with L2 regularization should get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    val numGDIterations = 50
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    // The 2% difference here is based on observation, but is not theoretically guaranteed.
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.02),
    +      "The last losses of LBFGS and GD should be within 2% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.02) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.02),
    +      "The weight differences between LBFGS and GD should be within 2%.")
    +  }
    +
    +  test("The convergence criteria should work as we expect.") {
    +    val regParam = 0.0
    +
    +    /**
    +     * For the first run, we set the convergenceTol to 0.0, so that the algorithm will
    +     * run up to the maxNumIterations which is 8 here.
    +     */
    +    val initialWeightsWithIntercept = Vectors.dense(0.0, 0.0)
    +    maxNumIterations = 8
    +    convergenceTol = 0
    +
    +    val (_, lossLBFGS1) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Note that the first loss is computed with initial weights,
    +    // so the total numbers of loss will be numbers of iterations + 1
    +    assert(lossLBFGS1.length == 9)
    +
    +    convergenceTol = 0.1
    +    val (_, lossLBFGS2) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Based on observation, lossLBFGS2 runs 3 iterations, no theoretically guaranteed.
    +    assert(lossLBFGS2.length == 4)
    +    assert((lossLBFGS2(2) - lossLBFGS2(3)) / lossLBFGS2(2) < convergenceTol)
    +
    +    convergenceTol = 0.01
    +    val (_, lossLBFGS3) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With smaller convergenceTol, it takes more steps.
    +    assert(lossLBFGS3.length > lossLBFGS2.length)
    +
    +    // Based on observation, lossLBFGS2 runs 5 iterations, no theoretically guaranteed.
    +    assert(lossLBFGS3.length == 6)
    +    assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) < convergenceTol)
    +  }
    +}
    +
    --- End diff --
    
    remove empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458730
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    --- End diff --
    
    Use JavaDoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460344
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.05) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.05),
    +      "The weight differences between LBFGS and GD should be within 5% difference.")
    +  }
    +
    +  test("Test if the convergence criteria works as we expect.") {
    --- End diff --
    
    Ditto. Remove "Test if "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40439479
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14126/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40250899
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14062/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40452733
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460419
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.05) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.05),
    +      "The weight differences between LBFGS and GD should be within 5% difference.")
    +  }
    +
    +  test("Test if the convergence criteria works as we expect.") {
    +    val regParam = 0.0
    +
    +    /**
    +     * For the first run, we set the convTolerance to 0.0, so that the algorithm will
    +     * run up to the maxNumIterations which is 8 here.
    +     */
    +    val initialWeightsWithIntercept = Vectors.dense(0.0, 0.0)
    +    maxNumIterations = 8
    +    convTolerance = 0
    +
    +    val (_, lossLBFGS1) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Note that the first loss is computed with initial weights,
    +    // so the total numbers of loss will be numbers of iterations + 1
    +    assert(lossLBFGS1.length == 9)
    +
    +    convTolerance = 0.1
    +    val (_, lossLBFGS2) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(lossLBFGS2.length == 4)
    --- End diff --
    
    Ditto. Add a comment here saying `4` is observed instead of theoretically derived.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai closed the pull request at:

    https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11463225
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    --- End diff --
    
    No, it is not necessary to do it in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39804441
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40174791
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39805202
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13873/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40429276
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40252714
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14064/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39821081
  
    @dbtsai Did you compare L-BFGS with MLlib's implementation of GD on some real data sets?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11571545
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    --- End diff --
    
    Should use 4-space indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40035143
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40434691
  
    Timeout for lastest jenkins run. It seems that CI is not stable now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459772
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    --- End diff --
    
    remove "Assert" so it reads "test LBFGS loss is ..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40283049
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11468948
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,209 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with LocalSparkContext with ShouldMatchers {
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  var convergenceTol = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add an extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  lazy val dataRDD = sc.parallelize(data, 2).cache()
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("LBFGS loss should be decreasing and match the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    // This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should
    +    // at least have the same performance. It's based on observation, no theoretically guaranteed.
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    --- End diff --
    
    @dbtsai This test still looks strange, because both gradient descent and L-BFGS are monotonic if the line search is correctly implemented. In our gradient descent implementation, step sizes are pre-defined, so it may not be monotonic if the initial step size is too large. However, L-BFGS uses line search, so maybe you should test the sequence of loss is strictly decreasing instead of 80 percent of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39805201
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40434890
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11571580
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convergenceTol)
    --- End diff --
    
    Also import `LBFGS` at the top of the file. Rename it to `brzLBFGS` to avoid name collision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40434555
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40174856
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11379843
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,251 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param lineSearchTolerance - The tolerance to control the accuracy of the line search.
    +   * @param convTolerance - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    lineSearchTolerance: Double,
    +    convTolerance: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +    var i = 0
    +
    +    val costFun = new DiffFunction[BDV[Double]] {
    --- End diff --
    
    Better create a private class for the cost function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40252713
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458457
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    --- End diff --
    
    Is `lineSearchTolerance` really used somewhere? Breeze uses fixed constants for line search.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11521070
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,209 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with LocalSparkContext with ShouldMatchers {
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  var convergenceTol = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add an extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  lazy val dataRDD = sc.parallelize(data, 2).cache()
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("LBFGS loss should be decreasing and match the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    // This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should
    +    // at least have the same performance. It's based on observation, no theoretically guaranteed.
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    --- End diff --
    
    You are right.  Since the cost function is convex, the loss is guaranteed to be monotonic decreased with L-BFGS optimizer. (SGD doesn't guarantee this, and the loss may be fluctuating in the optimization process.) Will add the test for this property.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40177723
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11528081
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +
    +    val costFun = new CostFun(
    +      data, gradient, updater, regParam, miniBatchFraction, lossHistory, miniBatchSize)
    +
    +    val lbfgs = new breeze.optimize.LBFGS[BDV[Double]](
    +      maxIter = maxNumIterations, m = numCorrections, tolerance = convergenceTol)
    +
    +    val weights = Vectors.fromBreeze(
    +      lbfgs.minimize(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector))
    +
    +    logInfo("LBFGS.runMiniBatchSGD finished. Last 10 losses %s".format(
    +      lossHistory.takeRight(10).mkString(", ")))
    +
    +    (weights, lossHistory.toArray)
    +  }
    +
    +  /**
    +   * CostFun implements Breeze's DiffFunction[T], which returns the loss and gradient
    +   * at a particular point (weights). It's used in Breeze's convex optimization routines.
    +   */
    +  private class CostFun(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    lossHistory: ArrayBuffer[Double],
    +    miniBatchSize: Double) extends DiffFunction[BDV[Double]] {
    +
    +    private var i = 0
    +
    +    override def calculate(weights: BDV[Double]) = {
    +      // Have a local copy to avoid the serialization of CostFun object which is not serializable.
    +      val localData = data
    +      val localGradient = gradient
    +
    +      val (gradientSum, lossSum) = localData.sample(false, miniBatchFraction, 42 + i)
    +        .aggregate((BDV.zeros[Double](weights.size), 0.0))(
    +          seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
    +            val l = localGradient.compute(
    +              features, label, Vectors.fromBreeze(weights), Vectors.fromBreeze(grad))
    +            (grad, loss + l)
    +          },
    +          combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
    +            (grad1 += grad2, loss1 + loss2)
    +          })
    +
    +      /**
    +       * regVal is sum of weight squares if it's L2 updater;
    +       * for other updater, the same logic is followed.
    +       */
    +      val regVal = updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
    +
    +      val loss = lossSum / miniBatchSize + regVal
    +      /**
    +       * It will return the gradient part of regularization using updater.
    +       *
    +       * Given the input parameters, the updater basically does the following,
    +       *
    +       * w' = w - thisIterStepSize * (gradient + regGradient(w))
    +       * Note that regGradient is function of w
    +       *
    +       * If we set gradient = 0, thisIterStepSize = 1, then
    +       *
    +       * regGradient(w) = w - w'
    +       *
    +       * TODO: We need to clean it up by separating the logic of regularization out
    +       *       from updater to regularizer.
    +       */
    +      // The following gradientTotal is actually the regularization part of gradient.
    +      // Will add the gradientSum computed from the data with weights in the next step.
    +      val gradientTotal = weights - updater.compute(
    +        Vectors.fromBreeze(weights),
    +        Vectors.dense(new Array[Double](weights.size)), 1, 1, regParam)._1.toBreeze
    +
    +      // gradientTotal = gradientSum / miniBatchSize + gradientTotal
    +      axpy(1.0 / miniBatchSize, gradientSum, gradientTotal)
    +
    +      /**
    +       * NOTE: lossSum and loss is computed using the weights from the previous iteration
    +       * and regVal is the regularization value computed in the previous iteration as well.
    +       */
    +      lossHistory.append(loss)
    +
    +      i += 1
    +
    +      (loss, gradientTotal)
    +    }
    +  }
    +
    +}
    --- End diff --
    
    You need exact one newline character at the end of file. There were two before and now zero of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39812370
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13877/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39804336
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460048
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    --- End diff --
    
    ditto. Remove "Assert that "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11461398
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    --- End diff --
    
    Good catch! It's used in RISO implementation. Just remove them. Thks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460767
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    --- End diff --
    
    @mengxr  
    I know. I pretty much follow the existing coding style in GradientDescent.scala 
    Should I also change the one in other place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11605030
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    --- End diff --
    
    Copied from GradientDescent.scala. Fixed both in next commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40439478
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39895404
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11458182
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,263 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    --- End diff --
    
    `m` is not defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40035145
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13972/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40250806
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

GitHub user dbtsai reopened a pull request:

    https://github.com/apache/spark/pull/353

    [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.

    This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr !
    
    When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way.
    
    Let's review how updater works when returning newWeights given the input parameters.
    
    w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that regGradient is function of w!
    If we set gradient = 0, thisIterStepSize = 1, then
    regGradient(w) = w - w'
    
    As a result, for regVal, it can be computed by 
    
        val regVal = updater.compute(
          weights,
          new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
    and for regGradient, it can be obtained by
    
          val regGradient = weights.sub(
            updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1)
    
    The PR includes the tests which compare the result with SGD with/without regularization.
    
    We did a comparison between LBFGS and SGD, and often we saw 10x less
    steps in LBFGS while the cost of per step is the same (just computing
    the gradient).
    
    The following is the paper by Prof. Ng at Stanford comparing different
    optimizers including LBFGS and SGD. They use them in the context of
    deep learning, but worth as reference.
    http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #353
    
----
commit 984b18e21396eae84656e15da3539ff3b5f3bf4a
Author: DB Tsai <db...@alpinenow.com>
Date:   2014-04-05T00:06:50Z

    L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40252639
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460320
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.05) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.05),
    +      "The weight differences between LBFGS and GD should be within 5% difference.")
    --- End diff --
    
    Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40458324
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14142/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40281879
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40458322
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11464280
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    --- End diff --
    
    Yeah, it's my observation. It can be 2% here. With increasing the # of GD iterations, I can achieve 1% or 0.5%. Will document it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460436
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.05) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.05),
    +      "The weight differences between LBFGS and GD should be within 5% difference.")
    +  }
    +
    +  test("Test if the convergence criteria works as we expect.") {
    +    val regParam = 0.0
    +
    +    /**
    +     * For the first run, we set the convTolerance to 0.0, so that the algorithm will
    +     * run up to the maxNumIterations which is 8 here.
    +     */
    +    val initialWeightsWithIntercept = Vectors.dense(0.0, 0.0)
    +    maxNumIterations = 8
    +    convTolerance = 0
    +
    +    val (_, lossLBFGS1) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Note that the first loss is computed with initial weights,
    +    // so the total numbers of loss will be numbers of iterations + 1
    +    assert(lossLBFGS1.length == 9)
    +
    +    convTolerance = 0.1
    +    val (_, lossLBFGS2) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(lossLBFGS2.length == 4)
    +    assert((lossLBFGS2(2) - lossLBFGS2(3)) / lossLBFGS2(2) < convTolerance)
    +
    +    // With smaller convTolerance, it takes more steps.
    +    convTolerance = 0.01
    +    val (_, lossLBFGS3) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(lossLBFGS3.length == 6)
    +    assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) < convTolerance)
    +  }
    +}
    +
    --- End diff --
    
    remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11459694
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    --- End diff --
    
    an extra


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11460449
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,217 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers {
    +  @transient private var sc: SparkContext = _
    +  var dataRDD:RDD[(Double, Vector)] = _
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  val lineSearchTolerance = 0.9
    +  var convTolerance = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add a extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  override def beforeAll() {
    +    sc = new SparkContext("local", "test")
    +    dataRDD = sc.parallelize(data, 2).cache()
    +  }
    +
    +  override def afterAll() {
    +    sc.stop()
    +    System.clearProperty("spark.driver.port")
    +  }
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("Assert LBFGS loss is decreasing and matches the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
    +      "LBFGS should match GD result within 5% error.")
    +  }
    +
    +  test("Assert that LBFGS and Gradient Descent with L2 regularization get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // With regularization, GD converges faster now!
    +    // So we only need 20 iterations to get the same result.
    +    val numGDIterations = 20
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.05),
    +      "The last losses of LBFGS and GD should be within 5% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.05) &&
    +        compareDouble(weightLBFGS(1), weightGD(1), 0.05),
    +      "The weight differences between LBFGS and GD should be within 5% difference.")
    +  }
    +
    +  test("Test if the convergence criteria works as we expect.") {
    +    val regParam = 0.0
    +
    +    /**
    +     * For the first run, we set the convTolerance to 0.0, so that the algorithm will
    +     * run up to the maxNumIterations which is 8 here.
    +     */
    +    val initialWeightsWithIntercept = Vectors.dense(0.0, 0.0)
    +    maxNumIterations = 8
    +    convTolerance = 0
    +
    +    val (_, lossLBFGS1) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // Note that the first loss is computed with initial weights,
    +    // so the total numbers of loss will be numbers of iterations + 1
    +    assert(lossLBFGS1.length == 9)
    +
    +    convTolerance = 0.1
    +    val (_, lossLBFGS2) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(lossLBFGS2.length == 4)
    +    assert((lossLBFGS2(2) - lossLBFGS2(3)) / lossLBFGS2(2) < convTolerance)
    +
    +    // With smaller convTolerance, it takes more steps.
    +    convTolerance = 0.01
    +    val (_, lossLBFGS3) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(lossLBFGS3.length == 6)
    --- End diff --
    
    Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11571555
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,259 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV, axpy}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(private var gradient: Gradient, private var updater: Updater)
    +  extends Optimizer with Logging {
    +
    +  private var numCorrections = 10
    +  private var convergenceTol = 1E-4
    +  private var maxNumIterations = 100
    +  private var regParam = 0.0
    +  private var miniBatchFraction = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of numCorrections less than 3 are not recommended; large values
    +   * of numCorrections will result in excessive computing time.
    +   * 3 < numCorrections < 10 is recommended.
    +   * Restriction: numCorrections > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvergenceTol(tolerance: Int): this.type = {
    +    this.convergenceTol = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +/**
    + * Top-level method to run LBFGS.
    + */
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param convergenceTol - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    convergenceTol: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    --- End diff --
    
    Use `numExamples` instead? `: Long` is not necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-39804838
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/353#issuecomment-40434460
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14122/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11468969
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
    @@ -0,0 +1,209 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import org.scalatest.BeforeAndAfterAll
    +import org.scalatest.FunSuite
    +import org.scalatest.matchers.ShouldMatchers
    +
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.LocalSparkContext
    +
    +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with LocalSparkContext with ShouldMatchers {
    +
    +  val nPoints = 10000
    +  val A = 2.0
    +  val B = -1.5
    +
    +  val initialB = -1.0
    +  val initialWeights = Array(initialB)
    +
    +  val gradient = new LogisticGradient()
    +  val numCorrections = 10
    +  var convergenceTol = 1e-12
    +  var maxNumIterations = 10
    +  val miniBatchFrac = 1.0
    +
    +  val simpleUpdater = new SimpleUpdater()
    +  val squaredL2Updater = new SquaredL2Updater()
    +
    +  // Add an extra variable consisting of all 1.0's for the intercept.
    +  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +  val data = testData.map { case LabeledPoint(label, features) =>
    +    label -> Vectors.dense(1.0, features.toArray: _*)
    +  }
    +
    +  lazy val dataRDD = sc.parallelize(data, 2).cache()
    +
    +  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
    +    math.abs(x - y) / (math.abs(y) + 1e-15) < tol
    +  }
    +
    +  test("LBFGS loss should be decreasing and match the result of Gradient Descent.") {
    +    val updater = new SimpleUpdater()
    +    val regParam = 0
    +
    +    val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*)
    +
    +    val (_, loss) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(loss.last - loss.head < 0, "loss isn't decreasing.")
    +
    +    val lossDiff = loss.init.zip(loss.tail).map {
    +      case (lhs, rhs) => lhs - rhs
    +    }
    +    // This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should
    +    // at least have the same performance. It's based on observation, no theoretically guaranteed.
    +    assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
    +
    +    val stepSize = 1.0
    +    // Well, GD converges slower, so it requires more iterations!
    +    val numGDIterations = 50
    +    val (_, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    // GD converges a way slower than L-BFGS. To achieve 1% difference,
    +    // it requires 90 iterations in GD. No matter how hard we increase
    +    // the number of iterations in GD here, the lossGD will be always
    +    // larger than lossLBFGS. This is based on observation, no theoretically guaranteed
    +    assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.02,
    +      "LBFGS should match GD result within 2% difference.")
    +  }
    +
    +  test("LBFGS and Gradient Descent with L2 regularization should get the same result.") {
    +    val regParam = 0.2
    +
    +    // Prepare another non-zero weights to compare the loss in the first iteration.
    +    val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
    +
    +    val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      numCorrections,
    +      convergenceTol,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    val numGDIterations = 50
    +    val stepSize = 1.0
    +    val (weightGD, lossGD) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      squaredL2Updater,
    +      stepSize,
    +      numGDIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept)
    +
    +    assert(compareDouble(lossGD(0), lossLBFGS(0)),
    +      "The first losses of LBFGS and GD should be the same.")
    +
    +    // The 2% difference here is based on observation, but is not theoretically guaranteed.
    +    assert(compareDouble(lossGD.last, lossLBFGS.last, 0.02),
    +      "The last losses of LBFGS and GD should be within 2% difference.")
    +
    +    assert(
    +      compareDouble(weightLBFGS(0), weightGD(0), 0.02) &&
    --- End diff --
    
    should fit the line above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/353#discussion_r11404094
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
    @@ -0,0 +1,251 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.optimization
    +
    +import scala.Array
    +import scala.collection.mutable.ArrayBuffer
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import breeze.optimize.{CachedDiffFunction, DiffFunction}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +
    +/**
    + * Class used to solve an optimization problem using Limited-memory BFGS.
    + * @param gradient Gradient function to be used.
    + * @param updater Updater to be used to update weights after every iteration.
    + */
    +class LBFGS(var gradient: Gradient, var updater: Updater)
    +  extends Optimizer with Logging
    +{
    +  private var numCorrections: Int = 10
    +  private var lineSearchTolerance: Double = 0.9
    +  private var convTolerance: Double = 1E-4
    +  private var maxNumIterations: Int = 100
    +  private var regParam: Double = 0.0
    +  private var miniBatchFraction: Double = 1.0
    +
    +  /**
    +   * Set the number of corrections used in the LBFGS update. Default 10.
    +   * Values of m less than 3 are not recommended; large values of m
    +   * will result in excessive computing time. 3 < m < 10 is recommended.
    +   * Restriction: m > 0
    +   */
    +  def setNumCorrections(corrections: Int): this.type = {
    +    assert(corrections > 0)
    +    this.numCorrections = corrections
    +    this
    +  }
    +
    +  /**
    +   * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9.
    +   * If the function and gradient evaluations are inexpensive with respect to the cost of
    +   * the iteration (which is sometimes the case when solving very large problems) it may
    +   * be advantageous to set to a small value. A typical small value is 0.1.
    +   * Restriction: should be greater than 1e-4.
    +   */
    +  def setLineSearchTolerance(tolerance: Double): this.type = {
    +    this.lineSearchTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set fraction of data to be used for each L-BFGS iteration. Default 1.0.
    +   */
    +  def setMiniBatchFraction(fraction: Double): this.type = {
    +    this.miniBatchFraction = fraction
    +    this
    +  }
    +
    +  /**
    +   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
    +   * Smaller value will lead to higher accuracy with the cost of more iterations.
    +   */
    +  def setConvTolerance(tolerance: Int): this.type = {
    +    this.convTolerance = tolerance
    +    this
    +  }
    +
    +  /**
    +   * Set the maximal number of iterations for L-BFGS. Default 100.
    +   */
    +  def setMaxNumIterations(iters: Int): this.type = {
    +    this.maxNumIterations = iters
    +    this
    +  }
    +
    +  /**
    +   * Set the regularization parameter. Default 0.0.
    +   */
    +  def setRegParam(regParam: Double): this.type = {
    +    this.regParam = regParam
    +    this
    +  }
    +
    +  /**
    +   * Set the gradient function (of the loss function of one single data example)
    +   * to be used for L-BFGS.
    +   */
    +  def setGradient(gradient: Gradient): this.type = {
    +    this.gradient = gradient
    +    this
    +  }
    +
    +  /**
    +   * Set the updater function to actually perform a gradient step in a given direction.
    +   * The updater is responsible to perform the update from the regularization term as well,
    +   * and therefore determines what kind or regularization is used, if any.
    +   */
    +  def setUpdater(updater: Updater): this.type = {
    +    this.updater = updater
    +    this
    +  }
    +
    +  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    +    val (weights, _) = LBFGS.runMiniBatchLBFGS(
    +      data,
    +      gradient,
    +      updater,
    +      numCorrections,
    +      lineSearchTolerance,
    +      convTolerance,
    +      maxNumIterations,
    +      regParam,
    +      miniBatchFraction,
    +      initialWeights)
    +    weights
    +  }
    +
    +}
    +
    +// Top-level method to run LBFGS.
    +object LBFGS extends Logging {
    +  /**
    +   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
    +   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
    +   * in order to compute a gradient estimate.
    +   * Sampling, and averaging the subgradients over this subset is performed using one standard
    +   * spark map-reduce in each iteration.
    +   *
    +   * @param data - Input data for L-BFGS. RDD of the set of data examples, each of
    +   *               the form (label, [feature values]).
    +   * @param gradient - Gradient object (used to compute the gradient of the loss function of
    +   *                   one single data example)
    +   * @param updater - Updater function to actually perform a gradient step in a given direction.
    +   * @param numCorrections - The number of corrections used in the L-BFGS update.
    +   * @param lineSearchTolerance - The tolerance to control the accuracy of the line search.
    +   * @param convTolerance - The convergence tolerance of iterations for L-BFGS
    +   * @param maxNumIterations - Maximal number of iterations that L-BFGS can be run.
    +   * @param regParam - Regularization parameter
    +   * @param miniBatchFraction - Fraction of the input data set that should be used for
    +   *                          one iteration of L-BFGS. Default value 1.0.
    +   *
    +   * @return A tuple containing two elements. The first element is a column matrix containing
    +   *         weights for every feature, and the second element is an array containing the loss
    +   *         computed for every iteration.
    +   */
    +  def runMiniBatchLBFGS(
    +    data: RDD[(Double, Vector)],
    +    gradient: Gradient,
    +    updater: Updater,
    +    numCorrections: Int,
    +    lineSearchTolerance: Double,
    +    convTolerance: Double,
    +    maxNumIterations: Int,
    +    regParam: Double,
    +    miniBatchFraction: Double,
    +    initialWeights: Vector): (Vector, Array[Double]) = {
    +
    +    val lossHistory = new ArrayBuffer[Double](maxNumIterations)
    +
    +    val nexamples: Long = data.count()
    +    val miniBatchSize = nexamples * miniBatchFraction
    +    var i = 0
    +
    +    val costFun = new DiffFunction[BDV[Double]] {
    --- End diff --
    
    I tested the optimizer with several real data, for example, small ones from UCI Machine Learning Repository, and some big data like mnist8m (although the property and stability of optimizer don't depend on the size of dataset), L-BFGS gives the same or better result compared with GD. For some dataset, GD will converge really slow after 40~50 iterations.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---