You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by Lewuathe <gi...@git.apache.org> on 2014/12/08 14:22:27 UTC

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

GitHub user Lewuathe opened a pull request:

    https://github.com/apache/spark/pull/3636

    [SPARK-3382] GradientDescent convergence tolerance

    GrandientDescent can receive convergence tolerance value. Default value is 0.0.
    When loss value becomes less than the tolerance which is set by user, iteration is terminated.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Lewuathe/spark gd-convergence-tolerance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3636.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3636
    
----
commit 5433f71a3822b0fb16b910f64dc53ede8d539ebe
Author: lewuathe <le...@me.com>
Date:   2014-12-08T13:19:21Z

    [SPARK-3382] GradientDescent convergence tolerance

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21480907
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -182,34 +195,38 @@ object GradientDescent extends Logging {
         var regVal = updater.compute(
           weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
     
    -    for (i <- 1 to numIterations) {
    -      val bcWeights = data.context.broadcast(weights)
    -      // Sample a subset (fraction miniBatchFraction) of the total data
    -      // compute and sum up the subgradients on this subset (this is one map-reduce)
    -      val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    -        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    -          seqOp = (c, v) => {
    -            // c: (grad, loss, count), v: (label, features)
    -            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    -            (c._1, c._2 + l, c._3 + 1)
    -          },
    -          combOp = (c1, c2) => {
    -            // c: (grad, loss, count)
    -            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    -          })
    -
    -      if (miniBatchSize > 0) {
    -        /**
    -         * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    -         * and regVal is the regularization value computed in the previous iteration as well.
    -         */
    -        stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    -        val update = updater.compute(
    -          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam)
    -        weights = update._1
    -        regVal = update._2
    -      } else {
    -        logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
    +    val b = new Breaks
    +    b.breakable {
    +      for (i <- 1 to numIterations) {
    +        val bcWeights = data.context.broadcast(weights)
    +        // Sample a subset (fraction miniBatchFraction) of the total data
    +        // compute and sum up the subgradients on this subset (this is one map-reduce)
    +        val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    +          .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    +            seqOp = (c, v) => {
    +              // c: (grad, loss, count), v: (label, features)
    +              val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    +              (c._1, c._2 + l, c._3 + 1)
    +            },
    +            combOp = (c1, c2) => {
    +              // c: (grad, loss, count)
    +              (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    +            })
    +
    +        if (miniBatchSize > 0) {
    +          /**
    +           * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    +           * and regVal is the regularization value computed in the previous iteration as well.
    +           */
    +          stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    +          val update = updater.compute(
    +            weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam)
    +          weights = update._1
    +          regVal = update._2
    +          if (stochasticLossHistory.last < convergenceTolerance) b.break
    --- End diff --
    
    This is comparing convergenceTolerance with the objective from the last iteration.  It should compare with the absolute value of the difference between the objective from the last iteration and the objective from the iteration before that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-96752804
  
    Thanks for pinging!  I'll make a pass now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-66588539
  
    @jkbradley I updated. Could you check it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21480867
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -39,6 +41,7 @@ class GradientDescent private[mllib] (private var gradient: Gradient, private va
       private var numIterations: Int = 100
       private var regParam: Double = 0.0
       private var miniBatchFraction: Double = 1.0
    +  private var convergenceTolerance: Double = 0.0
    --- End diff --
    
    I feel like the default should be > 0.0.  Something small like 0.001 (a value pulled from libsvm [https://github.com/cjlin1/libsvm/blob/master/python/svm.py]) might be reasonable.  Basically, I think that convergence tolerance is generally a better stopping criterion than numIterations, and having it > 0.0 will give it a chance of taking effect before numIterations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21556494
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala ---
    @@ -138,6 +138,45 @@ class GradientDescentSuite extends FunSuite with MLlibTestSparkContext with Matc
           "The different between newWeights with/without regularization " +
             "should be initialWeightsWithIntercept.")
       }
    +
    +  test("iteration should end with convergence tolerance") {
    +    val nPoints = 10000
    +    val A = 2.0
    +    val B = -1.5
    +
    +    val initialB = -1.0
    +    val initialWeights = Array(initialB)
    +
    +    val gradient = new LogisticGradient()
    +    val updater = new SimpleUpdater()
    +    val stepSize = 1.0
    +    val numIterations = 10
    +    val regParam = 0
    +    val miniBatchFrac = 1.0
    +    val convergenceTolerance = 5.0e-1
    +
    +    // Add a extra variable consisting of all 1.0's for the intercept.
    +    val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +    val data = testData.map { case LabeledPoint(label, features) =>
    +      label -> Vectors.dense(1.0 +: features.toArray)
    +    }
    +
    +    val dataRDD = sc.parallelize(data, 2).cache()
    +    val initialWeightsWithIntercept = Vectors.dense(1.0 +: initialWeights.toArray)
    +
    +    val (_, loss) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept,
    +      convergenceTolerance)
    +
    +    assert(loss.length < numIterations, "doesn't satisfy convergence tolerance")
    --- End diff --
    
    "doesn't satisfy convergence tolerance" --> "convergenceTolerance failed to stop optimization early"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-78716518
  
    @sryza Sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21480909
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -219,4 +236,17 @@ object GradientDescent extends Logging {
         (weights, stochasticLossHistory.toArray)
     
       }
    +
    +  def runMiniBatchSGD(
    --- End diff --
    
    It is odd to have an API with 2 different argument orders.  Can this please be fixed in 1 of these 2 ways:
    (1) Keep the old argument order, and have convergenceTolerance come after initialWeights.
    (2) Remove this old method call completely, and update the code base where relevant.
    I vote for (1) for consistency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r26540819
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -219,4 +265,39 @@ object GradientDescent extends Logging {
         (weights, stochasticLossHistory.toArray)
     
       }
    +
    +  def runMiniBatchSGD(
    +      data: RDD[(Double, Vector)],
    +      gradient: Gradient,
    +      updater: Updater,
    +      stepSize: Double,
    +      numIterations: Int,
    +      regParam: Double,
    +      miniBatchFraction: Double,
    +      initialWeights: Vector): (Vector, Array[Double]) =
    +    GradientDescent.runMiniBatchSGD(data, gradient, updater, stepSize, numIterations,
    +                                    regParam, miniBatchFraction, initialWeights, 0.001)
    +
    +
    +  private def isConverged(previousWeights: Vector, currentWeights: Vector,
    +                          initialWeights: Vector, convergenceTol: Double): Boolean = {
    +    require(previousWeights != None)
    +    require(currentWeights != None)
    +    // To compare with convergence tolerance
    +    def solutionVecDiff(previousWeight: Vector,
    +                        currentWeight: Vector): Double = {
    +
    +      val lastWeight = currentWeight.toBreeze
    +      val lastBeforeWeight = previousWeight.toBreeze
    +      sum((lastBeforeWeight - lastWeight)
    --- End diff --
    
    I'd use a temp value to make sure the vector subtraction is only done once.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-67069722
  
      [Test build #544 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/544/consoleFull) for   PR 3636 at commit [`f867eea`](https://github.com/apache/spark/commit/f867eea1735c39ccb939604f10817d8a6eb48b55).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r22151650
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -182,34 +206,43 @@ object GradientDescent extends Logging {
         var regVal = updater.compute(
           weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
     
    -    for (i <- 1 to numIterations) {
    -      val bcWeights = data.context.broadcast(weights)
    -      // Sample a subset (fraction miniBatchFraction) of the total data
    -      // compute and sum up the subgradients on this subset (this is one map-reduce)
    -      val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    -        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    -          seqOp = (c, v) => {
    -            // c: (grad, loss, count), v: (label, features)
    -            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    -            (c._1, c._2 + l, c._3 + 1)
    -          },
    -          combOp = (c1, c2) => {
    -            // c: (grad, loss, count)
    -            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    -          })
    -
    -      if (miniBatchSize > 0) {
    -        /**
    -         * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    -         * and regVal is the regularization value computed in the previous iteration as well.
    -         */
    -        stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    -        val update = updater.compute(
    -          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam)
    -        weights = update._1
    -        regVal = update._2
    -      } else {
    -        logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
    +    val b = new Breaks
    +    b.breakable {
    --- End diff --
    
    @mengxr Thank you. I can modify it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-66178359
  
    @Lewuathe Thanks for the PR!  I added some inline comments.  One more general comment: When using subsampling (miniBatchFraction < 1.0), testing against a convergenceTolerance can be dangerous because of the stochasticity.  It can be would be good to add a check at the beginning of optimization to see if miniBatchFraction < 1.0 && convergenceTolerance > 0.0.  If that is the case, then we should print a warning.
    
    Let me know when I should make another pass over the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21480898
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -77,6 +80,14 @@ class GradientDescent private[mllib] (private var gradient: Gradient, private va
       }
     
       /**
    +   * Set the convergence tolerance. Default 0.0
    --- End diff --
    
    It would be good to note what convergence tolerance is.  In particular, can you please note that it is compared with the change in the objective between consecutive iterations?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-66723145
  
    @Lewuathe Sorry---one more request.  Could you actually use "convergenceTol" instead of "convergenceTolerance" in order to fit with the public API in LBFGS?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r26540817
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -219,4 +265,39 @@ object GradientDescent extends Logging {
         (weights, stochasticLossHistory.toArray)
     
       }
    +
    +  def runMiniBatchSGD(
    +      data: RDD[(Double, Vector)],
    +      gradient: Gradient,
    +      updater: Updater,
    +      stepSize: Double,
    +      numIterations: Int,
    +      regParam: Double,
    +      miniBatchFraction: Double,
    +      initialWeights: Vector): (Vector, Array[Double]) =
    +    GradientDescent.runMiniBatchSGD(data, gradient, updater, stepSize, numIterations,
    +                                    regParam, miniBatchFraction, initialWeights, 0.001)
    +
    +
    +  private def isConverged(previousWeights: Vector, currentWeights: Vector,
    +                          initialWeights: Vector, convergenceTol: Double): Boolean = {
    +    require(previousWeights != None)
    +    require(currentWeights != None)
    +    // To compare with convergence tolerance
    +    def solutionVecDiff(previousWeight: Vector,
    --- End diff --
    
    I would use a val instead of nested methods since solutionVecDiff is only used once (and nested methods are more confusing to read).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21724910
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -182,34 +203,46 @@ object GradientDescent extends Logging {
         var regVal = updater.compute(
           weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
     
    -    for (i <- 1 to numIterations) {
    -      val bcWeights = data.context.broadcast(weights)
    -      // Sample a subset (fraction miniBatchFraction) of the total data
    -      // compute and sum up the subgradients on this subset (this is one map-reduce)
    -      val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    -        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    -          seqOp = (c, v) => {
    -            // c: (grad, loss, count), v: (label, features)
    -            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    -            (c._1, c._2 + l, c._3 + 1)
    -          },
    -          combOp = (c1, c2) => {
    -            // c: (grad, loss, count)
    -            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    -          })
    -
    -      if (miniBatchSize > 0) {
    -        /**
    -         * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    -         * and regVal is the regularization value computed in the previous iteration as well.
    -         */
    -        stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    -        val update = updater.compute(
    -          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam)
    -        weights = update._1
    -        regVal = update._2
    -      } else {
    -        logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
    +    val b = new Breaks
    +    b.breakable {
    +      for (i <- 1 to numIterations) {
    +        val bcWeights = data.context.broadcast(weights)
    +        // Sample a subset (fraction miniBatchFraction) of the total data
    +        // compute and sum up the subgradients on this subset (this is one map-reduce)
    +        val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    +          .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    +            seqOp = (c, v) => {
    +              // c: (grad, loss, count), v: (label, features)
    +              val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    +              (c._1, c._2 + l, c._3 + 1)
    +            },
    +            combOp = (c1, c2) => {
    +              // c: (grad, loss, count)
    +              (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    +            })
    +
    +        if (miniBatchSize > 0) {
    +          /**
    +           * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    +           * and regVal is the regularization value computed in the previous iteration as well.
    +           */
    +          stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    +          val update = updater.compute(
    --- End diff --
    
    No need to split arguments across multiple lines; I would leave this as it was. It's OK to have a line wrap in the middle of the argument list, if needed to make each line < 100 chars.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-66203442
  
    @jkbradley Thank you for reviewing. I'll update these points soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21556482
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -155,7 +169,13 @@ object GradientDescent extends Logging {
           numIterations: Int,
           regParam: Double,
           miniBatchFraction: Double,
    -      initialWeights: Vector): (Vector, Array[Double]) = {
    +      initialWeights: Vector,
    +      convergenceTolerance: Double): (Vector, Array[Double]) = {
    +
    +    // convergenceTolerance should be set with non minibatch settings
    +    if (miniBatchFraction < 1.0 && convergenceTolerance > 0.0) {
    +      logWarning("testing against a convergenceTolerance can be dangerous because of the stochasticity")
    --- End diff --
    
    Clarify:
    "Testing against a convergenceTolerance when using miniBatchFraction < 1.0 can be unstable because of the stochasticity in sampling."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r22078041
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -77,6 +80,17 @@ class GradientDescent private[mllib] (private var gradient: Gradient, private va
       }
     
       /**
    +   * Set the convergence tolerance. Default 0.001
    +   * convergenceTol is a condition which decides iteration termination.
    +   * If the difference between last loss and last before loss is less than convergenceTol
    --- End diff --
    
    It is not clear in the doc that whether this is relative or absolute. Also, the diff of the loss is not a good measure for convergence, especially when the problem is ill-conditioned. The diff of the solution vector is better. Usually, relative measure is used when the magnitude is greater than 1, or absolute measure otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21480864
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -27,6 +27,8 @@ import org.apache.spark.rdd.RDD
     import org.apache.spark.mllib.linalg.{Vectors, Vector}
     import org.apache.spark.mllib.rdd.RDDFunctions._
     
    +import scala.util.control.Breaks
    --- End diff --
    
    Please organize imports (Scala/Java, then non-Spark imports, then Spark)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21724909
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala ---
    @@ -138,6 +138,45 @@ class GradientDescentSuite extends FunSuite with MLlibTestSparkContext with Matc
           "The different between newWeights with/without regularization " +
             "should be initialWeightsWithIntercept.")
       }
    +
    +  test("iteration should end with convergence tolerance") {
    +    val nPoints = 10000
    +    val A = 2.0
    +    val B = -1.5
    +
    +    val initialB = -1.0
    +    val initialWeights = Array(initialB)
    +
    +    val gradient = new LogisticGradient()
    +    val updater = new SimpleUpdater()
    +    val stepSize = 1.0
    +    val numIterations = 10
    +    val regParam = 0
    +    val miniBatchFrac = 1.0
    +    val convergenceTolerance = 5.0e-1
    +
    +    // Add a extra variable consisting of all 1.0's for the intercept.
    +    val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +    val data = testData.map { case LabeledPoint(label, features) =>
    +      label -> MLUtils.appendBias(features)
    +    }
    +
    +    val dataRDD = sc.parallelize(data, 2).cache()
    +    val initialWeightsWithIntercept = Vectors.dense(1.0 +: initialWeights.toArray)
    --- End diff --
    
    MLUtils.appendBias appends (to the end), so the extra 1.0 should probably go at the end for consistency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-67069849
  
      [Test build #544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/544/consoleFull) for   PR 3636 at commit [`f867eea`](https://github.com/apache/spark/commit/f867eea1735c39ccb939604f10817d8a6eb48b55).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-96755432
  
    I realized that, with convergence being measured based on the weight vector, it does not make sense to measure relative to the initial weights (since they could easily be specified as an all-zero or other arbitrary vector).  Instead, can you please measure relative to the L2 norm of the previous weight vector?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21578119
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala ---
    @@ -138,6 +138,45 @@ class GradientDescentSuite extends FunSuite with MLlibTestSparkContext with Matc
           "The different between newWeights with/without regularization " +
             "should be initialWeightsWithIntercept.")
       }
    +
    +  test("iteration should end with convergence tolerance") {
    +    val nPoints = 10000
    +    val A = 2.0
    +    val B = -1.5
    +
    +    val initialB = -1.0
    +    val initialWeights = Array(initialB)
    +
    +    val gradient = new LogisticGradient()
    +    val updater = new SimpleUpdater()
    +    val stepSize = 1.0
    +    val numIterations = 10
    +    val regParam = 0
    +    val miniBatchFrac = 1.0
    +    val convergenceTolerance = 5.0e-1
    +
    +    // Add a extra variable consisting of all 1.0's for the intercept.
    --- End diff --
    
    Oh, I see.  Yes, please--thanks!  (It doesn't make the code any longer, does it?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-66115272
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-96619311
  
    Can someone review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r26540822
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -219,4 +265,39 @@ object GradientDescent extends Logging {
         (weights, stochasticLossHistory.toArray)
     
       }
    +
    +  def runMiniBatchSGD(
    +      data: RDD[(Double, Vector)],
    +      gradient: Gradient,
    +      updater: Updater,
    +      stepSize: Double,
    +      numIterations: Int,
    +      regParam: Double,
    +      miniBatchFraction: Double,
    +      initialWeights: Vector): (Vector, Array[Double]) =
    +    GradientDescent.runMiniBatchSGD(data, gradient, updater, stepSize, numIterations,
    +                                    regParam, miniBatchFraction, initialWeights, 0.001)
    +
    +
    +  private def isConverged(previousWeights: Vector, currentWeights: Vector,
    +                          initialWeights: Vector, convergenceTol: Double): Boolean = {
    +    require(previousWeights != None)
    +    require(currentWeights != None)
    +    // To compare with convergence tolerance
    +    def solutionVecDiff(previousWeight: Vector,
    +                        currentWeight: Vector): Double = {
    +
    +      val lastWeight = currentWeight.toBreeze
    +      val lastBeforeWeight = previousWeight.toBreeze
    +      sum((lastBeforeWeight - lastWeight)
    +        :* (lastBeforeWeight - lastWeight)) / lastWeight.length
    +    }
    +
    +    def squareAvg(weights: Vector): Double =
    +      sum(weights.toBreeze :* weights.toBreeze) / weights.toBreeze.length
    +
    +    solutionVecDiff(previousWeights, currentWeights) <
    +      convergenceTol * squareAvg(initialWeights)
    --- End diff --
    
    I believe Breeze has norm() functions you can use instead.
    This norm should be precomputed and stored.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21724908
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala ---
    @@ -138,6 +138,45 @@ class GradientDescentSuite extends FunSuite with MLlibTestSparkContext with Matc
           "The different between newWeights with/without regularization " +
             "should be initialWeightsWithIntercept.")
       }
    +
    +  test("iteration should end with convergence tolerance") {
    +    val nPoints = 10000
    +    val A = 2.0
    +    val B = -1.5
    +
    +    val initialB = -1.0
    +    val initialWeights = Array(initialB)
    +
    +    val gradient = new LogisticGradient()
    +    val updater = new SimpleUpdater()
    +    val stepSize = 1.0
    +    val numIterations = 10
    +    val regParam = 0
    +    val miniBatchFrac = 1.0
    +    val convergenceTolerance = 5.0e-1
    +
    +    // Add a extra variable consisting of all 1.0's for the intercept.
    +    val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
    +    val data = testData.map { case LabeledPoint(label, features) =>
    +      label -> MLUtils.appendBias(features)
    +    }
    +
    +    val dataRDD = sc.parallelize(data, 2).cache()
    +    val initialWeightsWithIntercept = Vectors.dense(1.0 +: initialWeights.toArray)
    +
    +    val (_, loss) = GradientDescent.runMiniBatchSGD(
    +      dataRDD,
    +      gradient,
    +      updater,
    +      stepSize,
    +      numIterations,
    +      regParam,
    +      miniBatchFrac,
    +      initialWeightsWithIntercept,
    +      convergenceTolerance)
    +
    +    assert(loss.length < numIterations, "convergenceTolerance failed to stop optimization early\"")
    --- End diff --
    
    No need for extra \" at end of comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-66721864
  
    @Lewuathe Thanks for the updates!  I just added a few last comments (which should be the last).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-80917580
  
    @jkbradley I think this patch was updated to use relative convergence tolerance. Are there any other point we should fix about convergence tolerance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21576963
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala ---
    @@ -138,6 +138,45 @@ class GradientDescentSuite extends FunSuite with MLlibTestSparkContext with Matc
           "The different between newWeights with/without regularization " +
             "should be initialWeightsWithIntercept.")
       }
    +
    +  test("iteration should end with convergence tolerance") {
    +    val nPoints = 10000
    +    val A = 2.0
    +    val B = -1.5
    +
    +    val initialB = -1.0
    +    val initialWeights = Array(initialB)
    +
    +    val gradient = new LogisticGradient()
    +    val updater = new SimpleUpdater()
    +    val stepSize = 1.0
    +    val numIterations = 10
    +    val regParam = 0
    +    val miniBatchFrac = 1.0
    +    val convergenceTolerance = 5.0e-1
    +
    +    // Add a extra variable consisting of all 1.0's for the intercept.
    --- End diff --
    
    I just copied the first test case("Assert the loss is decreasing.") Should I also change first case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21556490
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala ---
    @@ -138,6 +138,45 @@ class GradientDescentSuite extends FunSuite with MLlibTestSparkContext with Matc
           "The different between newWeights with/without regularization " +
             "should be initialWeightsWithIntercept.")
       }
    +
    +  test("iteration should end with convergence tolerance") {
    +    val nPoints = 10000
    +    val A = 2.0
    +    val B = -1.5
    +
    +    val initialB = -1.0
    +    val initialWeights = Array(initialB)
    +
    +    val gradient = new LogisticGradient()
    +    val updater = new SimpleUpdater()
    +    val stepSize = 1.0
    +    val numIterations = 10
    +    val regParam = 0
    +    val miniBatchFrac = 1.0
    +    val convergenceTolerance = 5.0e-1
    +
    +    // Add a extra variable consisting of all 1.0's for the intercept.
    --- End diff --
    
    I just realized there is already a method for this: MLUtils.appendBias
    Using that might be good since it's more human-readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-66345003
  
    @Lewuathe Thanks for the updates!  I just saw a couple more things, but I think it's almost ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-82005205
  
    With this change, we should probably constrain convergenceTol to be in [0, 1].  Could you please add that to the doc & add a check in setConvergenceTol?
    
    Also, it should be made clear in the doc that the norm being used for measuring convergence is the L2 norm.
    
    Thanks!  I think the logic is good now, and just a few cleanups remain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r29171065
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -219,4 +268,30 @@ object GradientDescent extends Logging {
         (weights, stochasticLossHistory.toArray)
     
       }
    +
    +  def runMiniBatchSGD(
    +      data: RDD[(Double, Vector)],
    +      gradient: Gradient,
    +      updater: Updater,
    +      stepSize: Double,
    +      numIterations: Int,
    +      regParam: Double,
    +      miniBatchFraction: Double,
    +      initialWeights: Vector): (Vector, Array[Double]) =
    +    GradientDescent.runMiniBatchSGD(data, gradient, updater, stepSize, numIterations,
    +                                    regParam, miniBatchFraction, initialWeights, 0.001)
    +
    +
    +  private def isConverged(previousWeights: Vector, currentWeights: Vector,
    --- End diff --
    
    scala style: Since the method header takes more than 1 line, this should have 1 argument per line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-78386542
  
    @Lewuathe mind placing [MLLIB] in the title so that this can get sorted properly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-81589501
  
    @jkbradley I'm sorry for misunderstand. I'll update to test based on that relative logic. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-81235479
  
    Oh, no, by "relative" convergence tolerance, we mean testing for:
    ```
    abs(previous_objective - current_object) < convergenceTol * abs(initial_objective)
    ```
    (with the addition of the initial objective)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r29171057
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -142,7 +157,10 @@ object GradientDescent extends Logging {
        * @param regParam - regularization parameter
        * @param miniBatchFraction - fraction of the input data set that should be used for
        *                            one iteration of SGD. Default value 1.0.
    -   *
    +   * @param convergenceTol - Minibatch iteration will end before numIterations if the
    --- End diff --
    
    There should not be a leading "-" for these parameters.  (I know you're just following previous work, but can you please remove them?)
    
    Also, please update doc based on using weight vector (not loss) and using relative change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382][MLLIB] GradientDescent convergenc...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-80626659
  
    @Lewuathe  Was a decision made about @mengxr 's suggestion about using relative convergence tolerance (instead of absolute, which is used by this PR currently)?  LBFGS (via Breeze) uses relative tolerance, so I'm pretty convinced we should use relative tolerance.
    
    (We need to document this for LBFGS too!  I just made a JIRA for that.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3636#issuecomment-67070947
  
    @Lewuathe  Once the scala style is fixed (dev/scalastyle), this should be ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21556486
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -182,34 +202,40 @@ object GradientDescent extends Logging {
         var regVal = updater.compute(
           weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
     
    -    for (i <- 1 to numIterations) {
    -      val bcWeights = data.context.broadcast(weights)
    -      // Sample a subset (fraction miniBatchFraction) of the total data
    -      // compute and sum up the subgradients on this subset (this is one map-reduce)
    -      val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    -        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    -          seqOp = (c, v) => {
    -            // c: (grad, loss, count), v: (label, features)
    -            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    -            (c._1, c._2 + l, c._3 + 1)
    -          },
    -          combOp = (c1, c2) => {
    -            // c: (grad, loss, count)
    -            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    -          })
    -
    -      if (miniBatchSize > 0) {
    -        /**
    -         * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    -         * and regVal is the regularization value computed in the previous iteration as well.
    -         */
    -        stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    -        val update = updater.compute(
    -          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam)
    -        weights = update._1
    -        regVal = update._2
    -      } else {
    -        logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
    +    val b = new Breaks
    +    b.breakable {
    +      for (i <- 1 to numIterations) {
    +        val bcWeights = data.context.broadcast(weights)
    +        // Sample a subset (fraction miniBatchFraction) of the total data
    +        // compute and sum up the subgradients on this subset (this is one map-reduce)
    +        val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    +          .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    +            seqOp = (c, v) => {
    +              // c: (grad, loss, count), v: (label, features)
    +              val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    +              (c._1, c._2 + l, c._3 + 1)
    +            },
    +            combOp = (c1, c2) => {
    +              // c: (grad, loss, count)
    +              (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    +            })
    +
    +        if (miniBatchSize > 0) {
    +          /**
    +           * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    +           * and regVal is the regularization value computed in the previous iteration as well.
    +           */
    +          stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    +          val update = updater.compute(
    +            weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam)
    +          weights = update._1
    +          regVal = update._2
    +          if (stochasticLossHistory.length > 1) {
    +            if (Math.abs(stochasticLossHistory.last - stochasticLossHistory(stochasticLossHistory.length - 2)) < convergenceTolerance) b.break
    --- End diff --
    
    scalastyle will complain (line length should be <= 100 chars)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by Lewuathe <gi...@git.apache.org>.

Github user Lewuathe commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r22151644
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -77,6 +80,17 @@ class GradientDescent private[mllib] (private var gradient: Gradient, private va
       }
     
       /**
    +   * Set the convergence tolerance. Default 0.001
    +   * convergenceTol is a condition which decides iteration termination.
    +   * If the difference between last loss and last before loss is less than convergenceTol
    --- End diff --
    
    @mengxr I'm sorry but I don't know what solution vector is in this context. What diff should I take to compare with convergence torelance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r22184450
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -77,6 +80,17 @@ class GradientDescent private[mllib] (private var gradient: Gradient, private va
       }
     
       /**
    +   * Set the convergence tolerance. Default 0.001
    +   * convergenceTol is a condition which decides iteration termination.
    +   * If the difference between last loss and last before loss is less than convergenceTol
    --- End diff --
    
    The "solution vector" is ```weights```.  (I agree with @mengxr on this.)  I'd vote for relative convergence tolerance such as:
    sum_i (oldWeights[i] - newWeights[i])^2 / numWeights



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21556478
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -142,7 +154,9 @@ object GradientDescent extends Logging {
        * @param regParam - regularization parameter
        * @param miniBatchFraction - fraction of the input data set that should be used for
        *                            one iteration of SGD. Default value 1.0.
    -   *
    +   * @param convergenceTolerance - Minibatch iteration will end within numIterations
    --- End diff --
    
    "will end within numIterations" --> "will end before numIterations" (clearer, IMO)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r22078050
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -182,34 +206,43 @@ object GradientDescent extends Logging {
         var regVal = updater.compute(
           weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2
     
    -    for (i <- 1 to numIterations) {
    -      val bcWeights = data.context.broadcast(weights)
    -      // Sample a subset (fraction miniBatchFraction) of the total data
    -      // compute and sum up the subgradients on this subset (this is one map-reduce)
    -      val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    -        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
    -          seqOp = (c, v) => {
    -            // c: (grad, loss, count), v: (label, features)
    -            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    -            (c._1, c._2 + l, c._3 + 1)
    -          },
    -          combOp = (c1, c2) => {
    -            // c: (grad, loss, count)
    -            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
    -          })
    -
    -      if (miniBatchSize > 0) {
    -        /**
    -         * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
    -         * and regVal is the regularization value computed in the previous iteration as well.
    -         */
    -        stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    -        val update = updater.compute(
    -          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam)
    -        weights = update._1
    -        regVal = update._2
    -      } else {
    -        logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
    +    val b = new Breaks
    +    b.breakable {
    --- End diff --
    
    Shall we use a while loop with a boolean flag? `breakable` creates a closure, which is not used in the Spark code base.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r22078042
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -77,6 +80,17 @@ class GradientDescent private[mllib] (private var gradient: Gradient, private va
       }
     
       /**
    +   * Set the convergence tolerance. Default 0.001
    +   * convergenceTol is a condition which decides iteration termination.
    +   * If the difference between last loss and last before loss is less than convergenceTol
    +   * minibatch iteration will end at that point.
    +   */
    +  def setconvergenceTol(tolerance: Double): this.type = {
    --- End diff --
    
    `setConvergenceTol` (camelCase)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3636#discussion_r21724912
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -77,6 +80,14 @@ class GradientDescent private[mllib] (private var gradient: Gradient, private va
       }
     
       /**
    +   * Set the convergence tolerance. Default 0.001
    --- End diff --
    
    Please describe what convergence tolerance is here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org