You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongwang218 <gi...@git.apache.org> on 2014/05/05 06:16:23 UTC

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

GitHub user dongwang218 opened a pull request:

    https://github.com/apache/spark/pull/643

    [MLLIB] SPARK-1682: Add gradient descent w/o sampling and RDA L1 updater

    The GradientDescent optimizer does sampling before a gradient step. When input data is already shuffled beforehand, it is possible to scan data and make gradient descent for each data instance. This could be potentially more efficient.
    
    Add enhanced RDA L1 updater, which could produce even sparse solutions with comparable quality compared with L1. Reference: 
    Lin Xiao, "Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization", Journal of Machine Learning Research 11 (2010) 2543-2596.
    
    Small fix: add options to BinaryClassification example to read and write model file

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongwang218/spark lr_svmlight

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/643.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #643
    
----
commit 50cdd69e7f8ebfa047a3b76efcc3ffb5e82b4cf7
Author: Dong Wang <dw...@gmail.com>
Date:   2014-05-01T00:36:28Z

    enable LogisticRegressionWithSGD to support svmlight data and gradient descent w/o sampling

commit 3131478826e1b943b2fd8fb02839d7b8df9b5377
Author: Dong Wang <dw...@gmail.com>
Date:   2014-05-01T18:54:23Z

    small fix for scalastyle

commit 5e6f5c43327aeea1978bde10f8621e156a9680f9
Author: Dong Wang <dw...@gmail.com>
Date:   2014-05-01T20:38:26Z

    Merge remote-tracking branch 'upstream/master' into lr_svmlight
    
    Conflicts:
    	mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

commit 96926db5488288bc6713d7be267e9adbe811b2f2
Author: Dong Wang <dw...@gmail.com>
Date:   2014-05-03T00:51:47Z

    add enhanced l1-RDA

commit 76c4d600b35becf124f02e3f0ed3ff9d9ae67a18
Author: Dong Wang <dw...@gmail.com>
Date:   2014-05-05T03:29:38Z

    add more options to BinaryClassification example

commit 87f96a269a5bebcaf45339007e9da57be51fa418
Author: Dong Wang <dw...@gmail.com>
Date:   2014-05-05T03:49:10Z

    Merge remote-tracking branch 'upstream/master' into lr_svmlight

commit 391d4bce492ef908fd6a21467e895368a85c2f10
Author: Dong Wang <dw...@gmail.com>
Date:   2014-05-05T04:03:34Z

    small fix: break long line

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-47963757
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46591810
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46591829
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12463354
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala ---
    @@ -90,55 +135,104 @@ object BinaryClassification {
         }
       }
     
    +  def parseModel(strModel: Seq[(String, Double)]): (Vector, Double) = {
    +    val numFeatures = strModel(0)._2.toInt
    +    val intercept = strModel(1)._2
    +    val weights = Array.fill(numFeatures) { 0.0d }
    +    strModel.slice(2, strModel.length).foreach { kv =>
    +      weights(kv._1.toInt) = kv._2
    +    }
    +    (Vectors.dense(weights), intercept)
    +  }
    +
       def run(params: Params) {
    -    val conf = new SparkConf().setAppName(s"BinaryClassification with $params")
    +    val conf = new SparkConf().setMaster(params.master)
    +        .setAppName(s"BinaryClassification with $params")
         val sc = new SparkContext(conf)
     
         Logger.getRootLogger.setLevel(Level.WARN)
     
         val examples = MLUtils.loadLibSVMData(sc, params.input).cache()
     
    -    val splits = examples.randomSplit(Array(0.8, 0.2))
    -    val training = splits(0).cache()
    -    val test = splits(1).cache()
    +    val (training, test) = params.mode match {
    +      case TRAIN => (examples, sc.emptyRDD[LabeledPoint])
    +      case TEST => (sc.emptyRDD[LabeledPoint], examples)
    +      case SPLIT =>
    +        val splits = examples.randomSplit(Array(0.8, 0.2))
    +        val training = splits(0).cache()
    +        val test = splits(1).cache()
    +        examples.unpersist(blocking = false)
    --- End diff --
    
    training and test are cached before unpersist examples, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42160779
  
    Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-45297247
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42637836
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14843/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42636082
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12463267
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -76,7 +76,7 @@ object MLUtils {
         } else {
           parsed.map { items =>
             if (items.length > 1) {
    -          items.last.split(':')(0).toInt
    +          items.tail.map { _.split(':')(0).toInt }.max
    --- End diff --
    
    make sense, reverted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46527471
  
    @mengxr, per your comments on stochastic update performance:
    1. Training data are reordered. The effect on runtime is quite small.
    2. Regularize weights after each update is indeed expensive. To avoid this, LazySquaredL2Updater and LazyL1Updater are added. During regularization, both lazy updater will accumulate weightShrinkage and weightTruncation. These two are applied to the sparse data when gradients are computed, which is implemented in computeDotProduct.
    
    I measured the runtime, quality on the both rcv1.binary's training and testing data. I will report training on the testing, as it is much bigger. --miniBatchFraction select between batch, minibatch and stochastic. All runtime is using local[5], miniBatch using 10% data.
    
    For L2 = 0.01 regularization
    | Method  | numIterations | stepSize | AUC | PR-AUC | Real Time |
    | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
    | batch  | 16  | 6.4 | 0.974	| 0.9771 | 1m 9s |
    | miniBatch  | 16  | 6.4 | 0.974 | 0.977 | 1m 2s |
    | stochastic | 1 | 0.2 | 0.974 | 0.9773 | 1m 6s|
    
    For L1 = 0.001 regularization
    | Method  | numIterations | stepSize | AUC | PR-AUC | Real Time | Nonzero Features|
    | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |------------- |
    | batch  | 16  | 6.4 | 0.944 | 0.950 | 53s | 201 | 
    | miniBatch  | 16  | 6.4 | 0.944 | 0.950 | 39s | 202 |
    | stochastic | 1 | 0.2 | 0.942 | 0.949 | 46s | 383 |
    
    Per instance stochastic update has similar quality and performance compared with batch and minibatch. Note that the last two used much larger stepSize in order to converge in small num of iterations to be competitive. Stochastic update is also applicable for training on stream data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42160737
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42160891
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42158182
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12463263
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/Updater.scala ---
    @@ -128,6 +128,45 @@ class L1Updater extends Updater {
     
     /**
      * :: DeveloperApi ::
    + * Updater for Enhanced L1-RDA regularized problems.
    + *          R(w) = ||w||_1
    + * Ignore the existing weights, but use average of gradient to compute new weights
    + * and apply L1 saturated thresholding. The enhanced version has `rho` which results
    + * in even sparse weights.
    --- End diff --
    
    this will be another pull request


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-47969923
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-47963771
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12333419
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ---
    @@ -40,6 +40,8 @@ class GradientDescent(private var gradient: Gradient, private var updater: Updat
       private var numIterations: Int = 100
       private var regParam: Double = 0.0
       private var miniBatchFraction: Double = 1.0
    +  private var stochastic: Boolean = true
    --- End diff --
    
    Having both `miniBatchFaction` and `stochastic` is a little confusing. I understand that you want to skip the sampling part. Can you do a check and skip sampling if `miniBatchFraction >= 1.0`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42160844
  
    thanks @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42160786
  
    @dongwang218 I don't think you can ask Jenkins to test, but I added you to the whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42693080
  
    @mengxr these two are valid concerns. For num 1, I will add a shuffle option to randomly reorder the instances before training, do you know is there an efficient way to do this in spark? For num 2, will run on rcv1.binary to compare with existing code. If runtime is an issue, I could add lazy regularization. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12332953
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala ---
    @@ -42,19 +45,33 @@ object BinaryClassification {
     
       object RegType extends Enumeration {
         type RegType = Value
    -    val L1, L2 = Value
    +    val L1, L2, RDA = Value
    +  }
    +
    +  object Mode extends Enumeration {
    +    type Mode = Value
    +    val TRAIN, TEST, SPLIT = Value
       }
     
       import Algorithm._
       import RegType._
    +  import Mode._
     
       case class Params(
    +      master: String = null,
    --- End diff --
    
    With `spark-submit`, it is not necessary to set master in an app.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46528239
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46528240
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15900/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12463226
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala ---
    @@ -42,19 +45,33 @@ object BinaryClassification {
     
       object RegType extends Enumeration {
         type RegType = Value
    -    val L1, L2 = Value
    +    val L1, L2, RDA = Value
    +  }
    +
    +  object Mode extends Enumeration {
    +    type Mode = Value
    +    val TRAIN, TEST, SPLIT = Value
       }
     
       import Algorithm._
       import RegType._
    +  import Mode._
     
       case class Params(
    +      master: String = null,
    --- End diff --
    
    removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42162173
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-45297249
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15490/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42336312
  
    @mengxr thanks for the comments, I will split and update this PR without RDA related changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12466797
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala ---
    @@ -42,19 +45,33 @@ object BinaryClassification {
     
       object RegType extends Enumeration {
         type RegType = Value
    -    val L1, L2 = Value
    +    val L1, L2, RDA = Value
    +  }
    +
    +  object Mode extends Enumeration {
    +    type Mode = Value
    +    val TRAIN, TEST, SPLIT = Value
       }
     
       import Algorithm._
       import RegType._
    +  import Mode._
     
       case class Params(
    +      master: String = null,
    --- End diff --
    
    removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42162174
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14660/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46527679
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12333957
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -76,7 +76,7 @@ object MLUtils {
         } else {
           parsed.map { items =>
             if (items.length > 1) {
    -          items.last.split(':')(0).toInt
    +          items.tail.map { _.split(':')(0).toInt }.max
    --- End diff --
    
    LIBSVM format must have indices strictly increasing. So this is not necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-47969925
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16307/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42691596
  
    @dongwang218 There are two issues with stochastic updates:
    
    1. It depends on the ordering of the training examples. Users are not instructed to randomize the training data. In many cases, positives and negatives are generated in different ways and the training dataset is a simple union of them. Could you try ordering the labels before training and see how it affects the performance?
    2. We use sparse vectors to take advantage of both storage and computation. If we apply the updater for every example, we lose the latter unless we do not put any regularization. Could you try training `rcv1.binary` and see how it affects the running time?
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12463232
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala ---
    @@ -42,19 +45,33 @@ object BinaryClassification {
     
       object RegType extends Enumeration {
         type RegType = Value
    -    val L1, L2 = Value
    +    val L1, L2, RDA = Value
    +  }
    +
    +  object Mode extends Enumeration {
    +    type Mode = Value
    +    val TRAIN, TEST, SPLIT = Value
    --- End diff --
    
    --test added and --mode is removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42326445
  
    @dongwang218 I think we better split this PR. One contains the changes to `examples/mllib/BinaryClassification.scala` minus RDA, which can be merged quickly, and the other contains changes related to RDA, which we may need more time to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-45295387
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46596550
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46527671
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46596551
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15913/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42636075
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-45295286
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42637835
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 closed the pull request at:

    https://github.com/apache/spark/pull/643


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by dongwang218 <gi...@git.apache.org>.
Github user dongwang218 commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-62420402
  
    revisit later


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42160885
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12333105
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala ---
    @@ -42,19 +45,33 @@ object BinaryClassification {
     
       object RegType extends Enumeration {
         type RegType = Value
    -    val L1, L2 = Value
    +    val L1, L2, RDA = Value
    +  }
    +
    +  object Mode extends Enumeration {
    +    type Mode = Value
    +    val TRAIN, TEST, SPLIT = Value
    --- End diff --
    
    How about using an option `--test`? When it is given, use it for evaluation, otherwise split the input data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12333196
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala ---
    @@ -90,55 +135,104 @@ object BinaryClassification {
         }
       }
     
    +  def parseModel(strModel: Seq[(String, Double)]): (Vector, Double) = {
    +    val numFeatures = strModel(0)._2.toInt
    +    val intercept = strModel(1)._2
    +    val weights = Array.fill(numFeatures) { 0.0d }
    +    strModel.slice(2, strModel.length).foreach { kv =>
    +      weights(kv._1.toInt) = kv._2
    +    }
    +    (Vectors.dense(weights), intercept)
    +  }
    +
       def run(params: Params) {
    -    val conf = new SparkConf().setAppName(s"BinaryClassification with $params")
    +    val conf = new SparkConf().setMaster(params.master)
    +        .setAppName(s"BinaryClassification with $params")
         val sc = new SparkContext(conf)
     
         Logger.getRootLogger.setLevel(Level.WARN)
     
         val examples = MLUtils.loadLibSVMData(sc, params.input).cache()
     
    -    val splits = examples.randomSplit(Array(0.8, 0.2))
    -    val training = splits(0).cache()
    -    val test = splits(1).cache()
    +    val (training, test) = params.mode match {
    +      case TRAIN => (examples, sc.emptyRDD[LabeledPoint])
    +      case TEST => (sc.emptyRDD[LabeledPoint], examples)
    +      case SPLIT =>
    +        val splits = examples.randomSplit(Array(0.8, 0.2))
    +        val training = splits(0).cache()
    +        val test = splits(1).cache()
    +        examples.unpersist(blocking = false)
    --- End diff --
    
    You need to materialize `training` and `test` before removing `examples` from cache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-1682: Add gradient descent w/o s...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/643#discussion_r12333653
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/Updater.scala ---
    @@ -128,6 +128,45 @@ class L1Updater extends Updater {
     
     /**
      * :: DeveloperApi ::
    + * Updater for Enhanced L1-RDA regularized problems.
    + *          R(w) = ||w||_1
    + * Ignore the existing weights, but use average of gradient to compute new weights
    + * and apply L1 saturated thresholding. The enhanced version has `rho` which results
    + * in even sparse weights.
    --- End diff --
    
    You need to explain the acronym `RDA` first and provide a reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---