You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MechCoder <gi...@git.apache.org> on 2015/04/02 11:41:19 UTC

[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

GitHub user MechCoder opened a pull request:

    https://github.com/apache/spark/pull/5330

    [SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation

    The previous PR https://github.com/apache/spark/pull/4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MechCoder/spark spark-5972

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5330.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5330
    
----
commit 100850ab4eb7cf74185a3959146e9af3f54d1ab3
Author: MechCoder <ma...@gmail.com>
Date:   2015-04-02T08:13:30Z

    [SPARK-5972] Cache residuals and gradient in GBT during training and validation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92209807
  
    Never mind, figured out, it should not take much effort, working on an update to this PR itself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92150095
  
    Your explanation for why the previous code worked is correct.  It's just doing extra communication.
    
    I just noticed a few things when opening up this PR in IntelliJ (yay error highlighting).  That really should be it, though.
    
    I'm running a speed test to see if I can tell a difference between this and the previous code.  I'll post again later today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700329
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    +   * iteration of gradient boosting.
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel, loss: Loss): RDD[(Double, Double)] = {
    +    data.map { i =>
    +      val pred = initTreeWeight * initTree.predict(i.features)
    +      val error = loss.computeError(pred, i.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Method to update a zipped predictionError RDD
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89251893
  
      [Test build #29662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29662/consoleFull) for   PR 5330 at commit [`5869533`](https://github.com/apache/spark/commit/586953361bfa4b020da29f1becd1da1537fc6bd2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700311
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala ---
    @@ -39,17 +39,15 @@ object LogLoss extends Loss {
        * Method to calculate the loss gradients for the gradient boosting calculation for binary
        * classification
        * The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
    -   * @param model Ensemble model
    -   * @param point Instance of the training dataset
    +   * @param prediction Predicted point
    +   * @param label True label.
        * @return Loss gradient
        */
    -  override def gradient(
    -      model: TreeEnsembleModel,
    -      point: LabeledPoint): Double = {
    -    val prediction = model.predict(point.features)
    -    - 4.0 * point.label / (1.0 + math.exp(2.0 * point.label * prediction))
    +  override def gradient(prediction: Double, label: Double): Double = {
    +    - 4.0 * label / (1.0 + math.exp(2.0 * label * prediction))
       }
     
    +
    --- End diff --
    
    Remove extra newline


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88926360
  
    This unrelated test failure related to YARN keeps recurring for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88849749
  
      [Test build #29605 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29605/consoleFull) for   PR 5330 at commit [`100850a`](https://github.com/apache/spark/commit/100850ab4eb7cf74185a3959146e9af3f54d1ab3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28195992
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +150,68 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Compute the initial predictions and errors for a dataset for the first
    +   * iteration of gradient boosting.
    +   * @param Training data.
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel.
    +   * @param loss: evaluation metric.
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel,
    +      loss: Loss): RDD[(Double, Double)] = {
    +    data.map { lp =>
    +      val pred = initTreeWeight * initTree.predict(lp.features)
    +      val error = loss.computeError(pred, lp.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Update a zipped predictionError RDD
    +   * (as obtained with computeInitialPredictionAndError)
    +   * @param training data.
    +   * @param predictionAndError: predictionError RDD
    +   * @param nTree: tree index.
    +   * @param treeWeight: Learning rate.
    +   * @param tree: Tree using which the prediction and error should be updated.
    +   * @param loss: evaluation metric.
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to each sample.
    +   */
    +  def updatePredictionError(
    +    data: RDD[LabeledPoint],
    +    predictionAndError: RDD[(Double, Double)],
    +    treeWeight: Double,
    +    tree: DecisionTreeModel,
    +    loss: Loss): RDD[(Double, Double)] = {
    +
    +    val sc = data.sparkContext
    +    val broadcastedTreeWeight = sc.broadcast(treeWeight)
    +    val broadcastedTree = sc.broadcast(tree)
    +
    +    val newPredError = data.zip(predictionAndError).mapPartitions { iter =>
    +      val currentTreeWeight = broadcastedTreeWeight.value
    +      val currentTree = broadcastedTree.value
    +      iter.map {
    +        case (lp, (pred, error)) => {
    +          val newPred = pred + currentTree.predict(lp.features) * currentTreeWeight
    +          val newError = loss.computeError(newPred, lp.label)
    +          (newPred, newError)
    +        }
    +      }
    +    }
    +
    +    broadcastedTreeWeight.unpersist()
    --- End diff --
    
    I realized that the broadcast variables can't be unpersisted here since they were used in a map for an RDD which has yet to be materialized.  (No action has been called on newPredError yet.)  We have 2 choices:
    * remove the unpersist, and let GC handle it once all of these RDDs go out of scope (at the end of training)
      * Let's do this one for now since it's simpler and since we aren't broadcasting (persisting) too much data.
    * return the broadcast variables, and keep them around until the caller can unpersist them safely
      * We can do this in the future if users ever encounter problems with too many broadcast variables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92211060
  
      [Test build #30137 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30137/consoleFull) for   PR 5330 at commit [`d542bb0`](https://github.com/apache/spark/commit/d542bb009251786706720bd2e22dfaf6796b566d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92460453
  
    @MechCoder Thanks a lot for working through all of these tweaks with me!  The updates look good except for those 2 items


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92491539
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30185/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91893961
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30083/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92184482
  
    @jkbradley fixed! hopefully that should be it :P


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27743710
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -195,17 +195,24 @@ object GradientBoostedTrees extends Logging {
         baseLearners(0) = firstTreeModel
         baseLearnerWeights(0) = 1.0
         val startingModel = new GradientBoostedTreesModel(Regression, Array(firstTreeModel), Array(1.0))
    -    logDebug("error of gbt = " + loss.computeError(startingModel, input))
    +
    +    var predError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(input, 1.0, firstTreeModel, loss)
    +    logDebug("error of gbt = " + predError.values.mean())
     
         // Note: A model of type regression is used since we require raw prediction
         timer.stop("building tree 0")
     
    -    var bestValidateError = if (validate) loss.computeError(startingModel, validationInput) else 0.0
    +    var validatePredError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(validationInput, 1.0, firstTreeModel, loss)
    +    var bestValidateError = if (validate) validatePredError.values.mean() else 0.0
         var bestM = 1
     
         // psuedo-residual for second iteration
    -    data = input.map(point => LabeledPoint(loss.gradient(startingModel, point),
    -      point.features))
    +    data = predError.zip(input).map {
    +      case ((pred, _), point) =>  LabeledPoint(loss.gradient(pred, point.label), point.features)
    --- End diff --
    
    You can check out Algorithm 1 in this paper: Friedman. "Stochastic Gradient Boosting." 1999.
    Intuitively, we want to minimize the loss, and using the gradient of the loss gives useful weights indicating how well/poorly we are doing at predicting each instance.  Since the gradient points in the direction of increasing loss, we want to negate it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92198905
  
    Thanks for the tests. I get the gist of what you mean.
    
    I'd be happy to merge this PR and work on this as a future JIRA. If I have any queries, I shall comment on that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700338
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    +   * iteration of gradient boosting.
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel, loss: Loss): RDD[(Double, Double)] = {
    +    data.map { i =>
    +      val pred = initTreeWeight * initTree.predict(i.features)
    +      val error = loss.computeError(pred, i.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Method to update a zipped predictionError RDD
    +   * (as obtained with computeInitialPredictionAndError)
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param predictionAndError: predictionError RDD
    +   * @param currentTreeWeight: learning rate.
    +   * @param currentTree: first DecisionTree
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponing to each sample.
    +   */
    +  def updatePredictionError(
    +    data: RDD[LabeledPoint],
    +    predictionAndError: RDD[(Double, Double)],
    +    currentTreeWeight: Double,
    +    currentTree: DecisionTreeModel,
    +    loss: Loss): RDD[(Double, Double)] = {
    +
    +    data.zip(predictionAndError).mapPartitions { iter =>
    +      iter.map {
    +        case (point, (pred, error)) => {
    --- End diff --
    
    "point" -->  (Use the same variable name for a labeled point everywhere for consistency.  I tend to use "lp")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91114211
  
    @jkbradley  I have fixed up your comment. So the idea is to broadcast each tree across all data, right? Sorry for the mess up. I got confused a bit, by your previous comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700335
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    +   * iteration of gradient boosting.
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel, loss: Loss): RDD[(Double, Double)] = {
    +    data.map { i =>
    +      val pred = initTreeWeight * initTree.predict(i.features)
    +      val error = loss.computeError(pred, i.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Method to update a zipped predictionError RDD
    +   * (as obtained with computeInitialPredictionAndError)
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param predictionAndError: predictionError RDD
    +   * @param currentTreeWeight: learning rate.
    +   * @param currentTree: first DecisionTree
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponing to each sample.
    --- End diff --
    
    typo: "corresponding"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89006769
  
    @MechCoder Thanks!  I'll make a pass through this soon; at first glance, it looks good.  Have you tested this vs. the old implementation?  I'm wondering how big a difference there is, and also how big the problem has to be for that difference to be evident.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700327
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    +   * iteration of gradient boosting.
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel, loss: Loss): RDD[(Double, Double)] = {
    +    data.map { i =>
    --- End diff --
    
    rename i -> "lp" or "labeledPoint"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92466609
  
      [Test build #30185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30185/consoleFull) for   PR 5330 at commit [`0b5d659`](https://github.com/apache/spark/commit/0b5d65900b22db5a0a9047fd2c3a1010d9d8c36b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91893953
  
      [Test build #30083 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30083/consoleFull) for   PR 5330 at commit [`58f4932`](https://github.com/apache/spark/commit/58f4932b7bdba52bdc9f28872687c6018ae86a8f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89054799
  
    @MechCoder  For this, I feel like local tests might be sufficient since they should show the speedup and since this isn't changing the communication that much.  My main worry is about RDD having long lineages; I made a JIRA today about that, but that can be addressed later on: [https://issues.apache.org/jira/browse/SPARK-6684]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91608329
  
      [Test build #30032 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30032/consoleFull) for   PR 5330 at commit [`70d3b4c`](https://github.com/apache/spark/commit/70d3b4c62fd0a60213a343d78a0b00d8e4b78a9a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91873006
  
      [Test build #30083 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30083/consoleFull) for   PR 5330 at commit [`58f4932`](https://github.com/apache/spark/commit/58f4932b7bdba52bdc9f28872687c6018ae86a8f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92481316
  
    LGTM once tests pass.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28267768
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +158,60 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Compute the initial predictions and errors for a dataset for the first
    +   * iteration of gradient boosting.
    +   * @param data: training data.
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel.
    +   * @param loss: evaluation metric.
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel,
    +      loss: Loss): RDD[(Double, Double)] = {
    +    data.map { lp =>
    +      val pred = initTreeWeight * initTree.predict(lp.features)
    +      val error = loss.computeError(pred, lp.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Update a zipped predictionError RDD
    +   * (as obtained with computeInitialPredictionAndError)
    +   * @param data: training data.
    +   * @param predictionAndError: predictionError RDD
    +   * @param treeWeight: Learning rate.
    +   * @param tree: Tree using which the prediction and error should be updated.
    +   * @param loss: evaluation metric.
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to each sample.
    +   */
    +  def updatePredictionError(
    +    data: RDD[LabeledPoint],
    +    predictionAndError: RDD[(Double, Double)],
    +    treeWeight: Double,
    +    tree: DecisionTreeModel,
    +    loss: Loss): RDD[(Double, Double)] = {
    +
    +    val sc = data.sparkContext
    --- End diff --
    
    no longer needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27717768
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -195,17 +195,24 @@ object GradientBoostedTrees extends Logging {
         baseLearners(0) = firstTreeModel
         baseLearnerWeights(0) = 1.0
         val startingModel = new GradientBoostedTreesModel(Regression, Array(firstTreeModel), Array(1.0))
    -    logDebug("error of gbt = " + loss.computeError(startingModel, input))
    +
    +    var predError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(input, 1.0, firstTreeModel, loss)
    +    logDebug("error of gbt = " + predError.values.mean())
     
         // Note: A model of type regression is used since we require raw prediction
         timer.stop("building tree 0")
     
    -    var bestValidateError = if (validate) loss.computeError(startingModel, validationInput) else 0.0
    +    var validatePredError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(validationInput, 1.0, firstTreeModel, loss)
    +    var bestValidateError = if (validate) validatePredError.values.mean() else 0.0
         var bestM = 1
     
         // psuedo-residual for second iteration
    -    data = input.map(point => LabeledPoint(loss.gradient(startingModel, point),
    -      point.features))
    +    data = predError.zip(input).map {
    +      case ((pred, _), point) =>  LabeledPoint(loss.gradient(pred, point.label), point.features)
    --- End diff --
    
    Could you tell me why should there be a negative sign in any case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89025457
  
    I do not have access to a cluster as said before. It would be great if you had some old benchmarks. However it seems it should not matter a lot at least for small n_iterations. But I suppose it would be good to have it anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27767011
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -195,17 +195,24 @@ object GradientBoostedTrees extends Logging {
         baseLearners(0) = firstTreeModel
         baseLearnerWeights(0) = 1.0
         val startingModel = new GradientBoostedTreesModel(Regression, Array(firstTreeModel), Array(1.0))
    -    logDebug("error of gbt = " + loss.computeError(startingModel, input))
    +
    +    var predError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(input, 1.0, firstTreeModel, loss)
    +    logDebug("error of gbt = " + predError.values.mean())
     
         // Note: A model of type regression is used since we require raw prediction
         timer.stop("building tree 0")
     
    -    var bestValidateError = if (validate) loss.computeError(startingModel, validationInput) else 0.0
    +    var validatePredError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(validationInput, 1.0, firstTreeModel, loss)
    +    var bestValidateError = if (validate) validatePredError.values.mean() else 0.0
         var bestM = 1
     
         // psuedo-residual for second iteration
    -    data = input.map(point => LabeledPoint(loss.gradient(startingModel, point),
    -      point.features))
    +    data = predError.zip(input).map {
    +      case ((pred, _), point) =>  LabeledPoint(loss.gradient(pred, point.label), point.features)
    --- End diff --
    
    Got it. Thanks a lot for the intuitive explanation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92184844
  
      [Test build #30137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30137/consoleFull) for   PR 5330 at commit [`d542bb0`](https://github.com/apache/spark/commit/d542bb009251786706720bd2e22dfaf6796b566d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92189854
  
    That is it, but I ran some timing tests locally and found essentially no difference between the two implementations, like you reported.  I think the issue is the overhead of broadcast variables.  I tried broadcasting the full arrays for evaluateEachIteration(), rather than each element separately, and it made evaluateEachIteration() take about 2/3 of the original time.  This was with depth-2 trees and 100 iterations of regression on a tiny test dataset ("abalone" from [libsvm's copy of the UCI dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html).
    
    Based on this bit of testing, I would guess the best solution will be to handle learning and evaluateEachIteration separately:
    * Learning: Do not broadcast trees or weights (but do use the caching you implemented here).
      * Communicating 1 tree should be a tiny cost compared to the cost of learning the tree.
    * evaluateEachIteration: Broadcast full tree array.  Compute errors for all iterations in a single map() call, and aggregate arrays of errors rather than individual errors.
      * Don't broadcast the weights array since it is small.
    
    I'm OK with merging this PR for now and making those items a future to-do.  But if you'd prefer to make these updates to this PR, that works too.
    
    Do you agree with this assessment?  What path would you prefer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89251899
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29662/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700333
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    +   * iteration of gradient boosting.
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel, loss: Loss): RDD[(Double, Double)] = {
    +    data.map { i =>
    +      val pred = initTreeWeight * initTree.predict(i.features)
    +      val error = loss.computeError(pred, i.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Method to update a zipped predictionError RDD
    +   * (as obtained with computeInitialPredictionAndError)
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700298
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -195,17 +195,24 @@ object GradientBoostedTrees extends Logging {
         baseLearners(0) = firstTreeModel
         baseLearnerWeights(0) = 1.0
         val startingModel = new GradientBoostedTreesModel(Regression, Array(firstTreeModel), Array(1.0))
    -    logDebug("error of gbt = " + loss.computeError(startingModel, input))
    +
    +    var predError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(input, 1.0, firstTreeModel, loss)
    --- End diff --
    
    Use "baseLearnerWeights(0)" instead of "1.0"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92519859
  
    Merged into master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700302
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -195,17 +195,24 @@ object GradientBoostedTrees extends Logging {
         baseLearners(0) = firstTreeModel
         baseLearnerWeights(0) = 1.0
         val startingModel = new GradientBoostedTreesModel(Regression, Array(firstTreeModel), Array(1.0))
    -    logDebug("error of gbt = " + loss.computeError(startingModel, input))
    +
    +    var predError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(input, 1.0, firstTreeModel, loss)
    +    logDebug("error of gbt = " + predError.values.mean())
     
         // Note: A model of type regression is used since we require raw prediction
         timer.stop("building tree 0")
     
    -    var bestValidateError = if (validate) loss.computeError(startingModel, validationInput) else 0.0
    +    var validatePredError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(validationInput, 1.0, firstTreeModel, loss)
    +    var bestValidateError = if (validate) validatePredError.values.mean() else 0.0
         var bestM = 1
     
         // psuedo-residual for second iteration
    --- End diff --
    
    "psuedo" --> "pseudo"  (I didn't notice before)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28208664
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +150,65 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Compute the initial predictions and errors for a dataset for the first
    +   * iteration of gradient boosting.
    +   * @param Training data.
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel.
    +   * @param loss: evaluation metric.
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel,
    +      loss: Loss): RDD[(Double, Double)] = {
    +    data.map { lp =>
    +      val pred = initTreeWeight * initTree.predict(lp.features)
    +      val error = loss.computeError(pred, lp.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Update a zipped predictionError RDD
    +   * (as obtained with computeInitialPredictionAndError)
    +   * @param training data.
    +   * @param predictionAndError: predictionError RDD
    +   * @param nTree: tree index.
    --- End diff --
    
    "nTree" no longer exists


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91124464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29923/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700304
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -195,17 +195,24 @@ object GradientBoostedTrees extends Logging {
         baseLearners(0) = firstTreeModel
         baseLearnerWeights(0) = 1.0
         val startingModel = new GradientBoostedTreesModel(Regression, Array(firstTreeModel), Array(1.0))
    -    logDebug("error of gbt = " + loss.computeError(startingModel, input))
    +
    +    var predError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(input, 1.0, firstTreeModel, loss)
    +    logDebug("error of gbt = " + predError.values.mean())
     
         // Note: A model of type regression is used since we require raw prediction
         timer.stop("building tree 0")
     
    -    var bestValidateError = if (validate) loss.computeError(startingModel, validationInput) else 0.0
    +    var validatePredError: RDD[(Double, Double)] = GradientBoostedTreesModel.
    +      computeInitialPredictionAndError(validationInput, 1.0, firstTreeModel, loss)
    +    var bestValidateError = if (validate) validatePredError.values.mean() else 0.0
         var bestM = 1
     
         // psuedo-residual for second iteration
    -    data = input.map(point => LabeledPoint(loss.gradient(startingModel, point),
    -      point.features))
    +    data = predError.zip(input).map {
    +      case ((pred, _), point) =>  LabeledPoint(loss.gradient(pred, point.label), point.features)
    --- End diff --
    
    This looks like a bug---not from you, but from before.  Shouldn't there be a "-" in front of "loss.gradient" (as in the gradient computation in the loop below)?
    
    Also, style: Put "case ... =>" on the line above:
    ```
    data = predError.zip(input).map { case (...) =>
      LabeledPoint(...)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28208660
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -27,6 +27,7 @@ import org.json4s.jackson.JsonMethods._
     import org.apache.spark.{Logging, SparkContext}
     import org.apache.spark.annotation.Experimental
     import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.broadcast.Broadcast
    --- End diff --
    
    no longer needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88925643
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29609/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700322
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    +   * iteration of gradient boosting.
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    --- End diff --
    
    Doc: "Training data"  The generated doc will already include the argument type.  Also a leftover problem from before...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91082537
  
    @MechCoder Apologies for the delay; I've been traveling.  Thanks for the updates!  The only issue I see now is with the use of broadcast variables; it's an issue with understanding how they work exactly.  Please read this: [https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables]
    (Actually, it's worth reading the whole programming guide.)
    
    They should be treated as read-only, so you should:
    * maintain baseLearners (and weights) as regular Arrays, not broadcast variables
    * create a new broadcast variable on each iteration for the new tree
    * never assign a value to a broadcast variable
    
    Let me know if you have questions about it.  (I can also push an update to the PR if you prefer.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89233785
  
      [Test build #29662 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29662/consoleFull) for   PR 5330 at commit [`5869533`](https://github.com/apache/spark/commit/586953361bfa4b020da29f1becd1da1537fc6bd2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700316
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -143,17 +141,9 @@ class GradientBoostedTreesModel(
         val broadcastWeights = sc.broadcast(treeWeights)
     
         (1 until numIterations).map { nTree =>
    -      predictionAndError = remappedData.zip(predictionAndError).mapPartitions { iter =>
    -        val currentTree = broadcastTrees.value(nTree)
    -        val currentTreeWeight = broadcastWeights.value(nTree)
    -        iter.map {
    -          case (point, (pred, error)) => {
    -            val newPred = pred + currentTree.predict(point.features) * currentTreeWeight
    -            val newError = loss.computeError(newPred, point.label)
    -            (newPred, newError)
    -          }
    -        }
    -      }
    +      predictionAndError = GradientBoostedTreesModel.updatePredictionError(
    +        remappedData, predictionAndError, broadcastWeights.value(nTree),
    --- End diff --
    
    You should pass the broadcast variables themselves; don't extract the value here since this happens before the mapPartitions().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92231465
  
      [Test build #30142 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30142/consoleFull) for   PR 5330 at commit [`32d409d`](https://github.com/apache/spark/commit/32d409df86fe8300b58b1bd5db976bea7d36f36c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28208661
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -131,34 +132,17 @@ class GradientBoostedTreesModel(
         val numIterations = trees.length
         val evaluationArray = Array.fill(numIterations)(0.0)
     
    -    var predictionAndError: RDD[(Double, Double)] = remappedData.map { i =>
    -      val pred = treeWeights(0) * trees(0).predict(i.features)
    -      val error = loss.computeError(pred, i.label)
    -      (pred, error)
    -    }
    -    evaluationArray(0) = predictionAndError.values.mean()
    +    var predictionAndError = GradientBoostedTreesModel.computeInitialPredictionAndError(
    +      remappedData, treeWeights(0), trees(0), loss)
     
    -    // Avoid the model being copied across numIterations.
    -    val broadcastTrees = sc.broadcast(trees)
    -    val broadcastWeights = sc.broadcast(treeWeights)
    +    evaluationArray(0) = predictionAndError.values.mean()
     
         (1 until numIterations).map { nTree =>
    --- End diff --
    
    "map" --> "foreach"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700310
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala ---
    @@ -39,17 +39,15 @@ object LogLoss extends Loss {
        * Method to calculate the loss gradients for the gradient boosting calculation for binary
        * classification
        * The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
    -   * @param model Ensemble model
    -   * @param point Instance of the training dataset
    +   * @param prediction Predicted point
    --- End diff --
    
    "Predicted label" (Please correct elsewhere too)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89216224
  
    I've fixed up your comments. 
    
    It seems that the local tests seem to run in the same time, this may due to the fact that numIterations and data size are comparatively low, to take advantage of this.
    
    I can work on the other issue after this is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92464457
  
    @jkbradley fixed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88864054
  
      [Test build #29605 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29605/consoleFull) for   PR 5330 at commit [`100850a`](https://github.com/apache/spark/commit/100850ab4eb7cf74185a3959146e9af3f54d1ab3).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91124457
  
      [Test build #29923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29923/consoleFull) for   PR 5330 at commit [`67b5311`](https://github.com/apache/spark/commit/67b531125204418c990103d497514f2a6335da03).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89230933
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29660/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91876121
  
    @jkbradley I've fixed up your comment!
    
    Thanks for the info. Just to clarify, does the previous code work because a copy of the broadcast variable in the driver node persists even after unpersisting and it broadcasts repeatedly for each action to `PredError` . Source (http://stackoverflow.com/a/24587558/1170730)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88848340
  
    ping @jkbradley


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91114090
  
      [Test build #29923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29923/consoleFull) for   PR 5330 at commit [`67b5311`](https://github.com/apache/spark/commit/67b531125204418c990103d497514f2a6335da03).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88864075
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29605/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700319
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    --- End diff --
    
    Doc: "Compute the initial predictions and errors for a dataset for the first iteration of gradient boosting."  No need to say it's a "method."  This is a leftover problem from before, so don't follow previous examples.  : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5330


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89230910
  
      [Test build #29660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29660/consoleFull) for   PR 5330 at commit [`c0869e7`](https://github.com/apache/spark/commit/c0869e759814ce2ac8fdc8d120e077ca5d828e7b).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28208662
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +150,65 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Compute the initial predictions and errors for a dataset for the first
    +   * iteration of gradient boosting.
    +   * @param Training data.
    --- End diff --
    
    "Training data" --> "data: training data"  (need to specify parameter name "data")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28208663
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +150,65 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Compute the initial predictions and errors for a dataset for the first
    +   * iteration of gradient boosting.
    +   * @param Training data.
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel.
    +   * @param loss: evaluation metric.
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel,
    +      loss: Loss): RDD[(Double, Double)] = {
    +    data.map { lp =>
    +      val pred = initTreeWeight * initTree.predict(lp.features)
    +      val error = loss.computeError(pred, lp.label)
    +      (pred, error)
    +    }
    +  }
    +
    +  /**
    +   * Update a zipped predictionError RDD
    +   * (as obtained with computeInitialPredictionAndError)
    +   * @param training data.
    --- End diff --
    
    "Training data" --> "data: training data"  (need to specify parameter name "data")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700324
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -166,6 +156,56 @@ class GradientBoostedTreesModel(
     
     object GradientBoostedTreesModel extends Loader[GradientBoostedTreesModel] {
     
    +  /**
    +   * Method to compute initial error and prediction as a RDD for the first
    +   * iteration of gradient boosting.
    +   * @param data RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param initTreeWeight: learning rate assigned to the first tree.
    +   * @param initTree: first DecisionTreeModel
    +   * @param loss: evaluation metric
    +   * @return a RDD with each element being a zip of the prediction and error
    +   *         corresponding to every sample.
    +   */
    +  def computeInitialPredictionAndError(
    +      data: RDD[LabeledPoint],
    +      initTreeWeight: Double,
    +      initTree: DecisionTreeModel, loss: Loss): RDD[(Double, Double)] = {
    --- End diff --
    
    style: 1 parameter per line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-90824467
  
    @jkbradley Any news on this? Sorry if you are busy, but it has almost been a week ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91608342
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30032/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91871496
  
    After that one small change, I think this will be ready to merge.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r28267765
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala ---
    @@ -131,34 +131,26 @@ class GradientBoostedTreesModel(
         val numIterations = trees.length
         val evaluationArray = Array.fill(numIterations)(0.0)
     
    -    var predictionAndError: RDD[(Double, Double)] = remappedData.map { i =>
    -      val pred = treeWeights(0) * trees(0).predict(i.features)
    -      val error = loss.computeError(pred, i.label)
    -      (pred, error)
    -    }
    +    var predictionAndError = GradientBoostedTreesModel.computeInitialPredictionAndError(
    +      remappedData, treeWeights(0), trees(0), loss)
    +
         evaluationArray(0) = predictionAndError.values.mean()
     
    -    // Avoid the model being copied across numIterations.
         val broadcastTrees = sc.broadcast(trees)
    -    val broadcastWeights = sc.broadcast(treeWeights)
    -
         (1 until numIterations).map { nTree =>
           predictionAndError = remappedData.zip(predictionAndError).mapPartitions { iter =>
             val currentTree = broadcastTrees.value(nTree)
    -        val currentTreeWeight = broadcastWeights.value(nTree)
    -        iter.map {
    -          case (point, (pred, error)) => {
    -            val newPred = pred + currentTree.predict(point.features) * currentTreeWeight
    -            val newError = loss.computeError(newPred, point.label)
    -            (newPred, newError)
    -          }
    +        val currentTreeWeight = treeWeights(nTree)
    --- End diff --
    
    We should make a local (shallow) copy of treeWeights before the map, within this method:
    ```
    val localTreeWeights = treeWeights
    ```
    Referencing treeWeights, a member of the class, will actually make the entire class get serialized by the ClosureCleaner.  Assigning it to a local val fixes that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-89213266
  
      [Test build #29660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29660/consoleFull) for   PR 5330 at commit [`c0869e7`](https://github.com/apache/spark/commit/c0869e759814ce2ac8fdc8d120e077ca5d828e7b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88892913
  
      [Test build #29609 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29609/consoleFull) for   PR 5330 at commit [`57cd906`](https://github.com/apache/spark/commit/57cd906912893e1bfde3b6f8d2556dedef3b5794).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92213928
  
      [Test build #30142 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30142/consoleFull) for   PR 5330 at commit [`32d409d`](https://github.com/apache/spark/commit/32d409df86fe8300b58b1bd5db976bea7d36f36c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-91584110
  
      [Test build #30032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30032/consoleFull) for   PR 5330 at commit [`70d3b4c`](https://github.com/apache/spark/commit/70d3b4c62fd0a60213a343d78a0b00d8e4b78a9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92213809
  
    @jkbradley I pushed an update to the same PR. I agree with the observation, that it would have a much higher impact on evaluateEachIteration, because during training, prediction (and computing the residuals) is not really the bottleneck


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92231474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30142/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92491526
  
      [Test build #30185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30185/consoleFull) for   PR 5330 at commit [`0b5d659`](https://github.com/apache/spark/commit/0b5d65900b22db5a0a9047fd2c3a1010d9d8c36b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92211097
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30137/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5330#discussion_r27700307
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -242,8 +255,9 @@ object GradientBoostedTrees extends Logging {
             }
           }
           // Update data with pseudo-residuals
    -      data = input.map(point => LabeledPoint(-loss.gradient(partialModel, point),
    -        point.features))
    +      data = predError.zip(input).map {
    +        case ((pred, _), point) =>  LabeledPoint(-loss.gradient(pred, point.label), point.features)
    --- End diff --
    
    remove extra space after "=>"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5972] [MLlib] Cache residuals and gradi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-88925576
  
      [Test build #29609 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29609/consoleFull) for   PR 5330 at commit [`57cd906`](https://github.com/apache/spark/commit/57cd906912893e1bfde3b6f8d2556dedef3b5794).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org